-
Notifications
You must be signed in to change notification settings - Fork 1
CommonCrawl Hadoop Support Library
abayer/commoncrawl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The start of the shared commoncrawl code repository. Please set hadoop.version and hadoop.path in build.properties to point to your version of hadoop. Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script. For example, to run the BasicArcFileReaderSample against the ARC file 2010/01/07/18/1262876244253_18.arc.gz in the main commoncrawl bucket, commoncrawl-crawl-002, you would run the following command line: bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample <<AWS ACCESS KEY>> <<AWS SECRET KEY>> commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz The luancher runs the command in the background. You can monitor progress via either ./logs/<<ClassName>>.log for LOG output, or ./logs/<<ClassName>>_run.log for stdout output.
About
CommonCrawl Hadoop Support Library
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published