Skip to content

A repository for all code written for the MIDS-251 Final project by Chuck, Rama, Chris, and Andrew.

Notifications You must be signed in to change notification settings

ChrisMcPherson/mids-251-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

mids-251-web-crawler

A repository for all code written for the MIDS-251 Final project by Chuck, Rama, Chris, and Andrew.

This web crawler is a closed domain crawler for information on medical doctors and procedures. It provides simple query capabilities for users interested in finding web sites related to medical search terms. The application can be categorized into four steps:

  • Provided a set of seed URLs, the web crawler will identify all URLs referenced in each seed and stores them. The crawler is also configurable to recursively process each new URL and will do so for as many levels of recursion that is specified.
  • Download HTML from each valid URL provided by previous step and store content in a MongoDB NoSQL Document Store.
  • Parse HTML documents stored in MongoDB to extract meaningful keywords and store results into Cassandra, a NoSQL database.
  • Using a simple query interface, extract and serve the URLs that are relevant to the keyword search.

Additional information about the contents of each folder can be found in the README file in the folders.

About

A repository for all code written for the MIDS-251 Final project by Chuck, Rama, Chris, and Andrew.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published