infinite crawl #12

martingg88 · 2015-05-15T08:17:19Z

will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?

jculvey · 2015-05-15T20:55:32Z

A site will have a finite number of pages. The crawler avoids cycles by keeping a set of previously visited urls. For example, link A references B. B is crawled, and it references A. A wont be recrawled, since its URL is in the set of seen urls.

In addition, all URLS are normalized before they are crawled and stored in the visited urls set. This helps avoid duplicate page crawls. Here's an example:

http://foo.com/people?age=30&filter=joe&sort=up
https://foo.com/people?age=30&sort=up&filter=joe

In this case, the urls differ, but in most cases these will produce the same response. You can read more about roboto's normalization routine here: https://github.com/jculvey/roboto#url-normalization

martingg88 · 2015-05-16T02:49:28Z

one last question here. does it support stop and resume feature?

jculvey · 2015-05-16T05:00:16Z

Nope, not yet. Sorry :/

It's one of the things people have asked for. I'll look into adding it soon.

Would having something like redis or sqlite as a dependency be an issue for you?

martingg88 · 2015-05-16T06:53:40Z

great. thanks. How about waterline adapter that developer can have his/her choice for any database available in node.js ecosystem?

here is the reference for waterline adapter.

https://github.com/balderdashy/waterline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infinite crawl #12

infinite crawl #12

martingg88 commented May 15, 2015

jculvey commented May 15, 2015

martingg88 commented May 16, 2015

jculvey commented May 16, 2015

martingg88 commented May 16, 2015

infinite crawl #12

infinite crawl #12

Comments

martingg88 commented May 15, 2015

jculvey commented May 15, 2015

martingg88 commented May 16, 2015

jculvey commented May 16, 2015

martingg88 commented May 16, 2015