mids-251-web-crawler

A repository for all code written for the MIDS-251 Final project by Chuck, Rama, Chris, and Andrew.

This web crawler is a closed domain crawler for information on medical doctors and procedures. It provides simple query capabilities for users interested in finding web sites related to medical search terms. The application can be categorized into four steps:

Provided a set of seed URLs, the web crawler will identify all URLs referenced in each seed and stores them. The crawler is also configurable to recursively process each new URL and will do so for as many levels of recursion that is specified.
Download HTML from each valid URL provided by previous step and store content in a MongoDB NoSQL Document Store.
Parse HTML documents stored in MongoDB to extract meaningful keywords and store results into Cassandra, a NoSQL database.
Using a simple query interface, extract and serve the URLs that are relevant to the keyword search.

Additional information about the contents of each folder can be found in the README file in the folders.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
source		source
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mids-251-web-crawler

About

Releases

Packages

Languages

ChrisMcPherson/mids-251-web-crawler

Folders and files

Latest commit

History

Repository files navigation

mids-251-web-crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages