Autonomous web text crawling (googling) for big data (natural language processing)
For a given string (e.g., "apple"), these codes (1) google the string, (2) retrieve html pages, (3) extract visible texts from the pages, and then, (4) compress all the texts to a zip file.
To run, you'll need key.json
which this repository does not include. The format should be as below, and the values should be yours. They are required by Google.
{
"api_key": "your-google-api-key",
"cse_id": "your-cse-id"
}
I referred http://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search.
pip install google-api-python-client
pip install html2text
If there can be more simple or easier way to do this, please lighten me up.
** This is still during construction.