-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infinite crawl #12
Comments
A site will have a finite number of pages. The crawler avoids cycles by keeping a set of previously visited urls. For example, link A references B. B is crawled, and it references A. A wont be recrawled, since its URL is in the set of seen urls. In addition, all URLS are normalized before they are crawled and stored in the visited urls set. This helps avoid duplicate page crawls. Here's an example: http://foo.com/people?age=30&filter=joe&sort=up In this case, the urls differ, but in most cases these will produce the same response. You can read more about roboto's normalization routine here: https://github.com/jculvey/roboto#url-normalization |
one last question here. does it support stop and resume feature? |
Nope, not yet. Sorry :/ It's one of the things people have asked for. I'll look into adding it soon. Would having something like redis or sqlite as a dependency be an issue for you? |
great. thanks. How about waterline adapter that developer can have his/her choice for any database available in node.js ecosystem? here is the reference for waterline adapter. |
will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?
The text was updated successfully, but these errors were encountered: