Skip to content

Continuous Recrawling Phase C Design Notes

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

Phase C Design Notes: Settings & Checkpoint Updates

Stage C is the implementation and testing via realistic crawling of more sophisticated revisit heuristics.

Plan: implement smarter revisit scheduling

  • allow different revisit min/max for different sites and URL patterns
  • implement simple exponential grow/decay in reaction to observed change
  • implement additional HTTP header-based heuristics

Issue: Monitoring progress/coverage & operator adjustments

  • extend queue dynamic analysis to provide warnings if target visitation rates unattainable
  • add UI for viewing queue progress rates/expected finish times
  • provide bulk queue edit/reschedule operations

Plan: run 'concrete scenario' for 3 months+

  • use to evaluate performance & usability

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally