-
Notifications
You must be signed in to change notification settings - Fork 763
Spring Crawl Configuration
Alex Osborne edited this page Jul 4, 2018
·
2 revisions
Heritrix3 now makes use of the 'Spring Container' (and its XML-based configuration format) to assemble a runnable crawl, choosing from among alternate compatible implementations and settings values.
Developers will find it helpful to review the relevant chapter of Spring's reference documentation to learn all the options provided by the container and configuration format:
Spring Framework, Chapter 3: The IoC Container
Some key insights to understanding this model are:
- Applications are large groupings of collaborating components, and often components have alternate, swappable implementations. (In our case, one runnable crawl job, with chosen settings and options, is one application.)
- The configuration file(s) declare all participating components, and, where necessary, initial assignment values.
- The 'container' uses the configuration file(s), plus other hints derived from the components themselves (like compatible types and settings-names), to assemble all components with their initial state and direct references to their collaborators. If a component is needed (as implied by other components), but insufficiently declared, errors are thrown.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse