-
Elastic Map-Reduce
-
Brew
-
Amazon
-
CDH and HortonWorks
-
MortarData and TreasureData
The examples in this book use a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. The datasets are:
-
Wikipedia English-language Article Corpus (
wikipedia_corpus
; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article. -
Wikipedia Pagelink Graph (
wikipedia_pagelinks
; ): every page-to-page hyperlink in wikipedia. -
Wikipedia Pageview Stats (
wikipedia_pageviews
; 2.3 TB, about 250 billion records (FIXME: verify num records)): hour-by-hour pageviews for all of Wikipedia -
ASA SC/SG Data Expo Airline Flights (
airline_flights
; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes. -
NCDC Hourly Global Weather Measurements, 1929-2009 (
ncdc_weather_hourly
; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929. -
1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format. -
60,000 UFO Sightings
-
Retrosheet Game Logs — every recorded baseball game back to the 1890s
-
Gutenberg Corpus
- - a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
-
Twilio/Wigle.net Street Vector Data Set — — geo — Twilio/Wigle.net database of mapped US street names and address ranges.
-
2008 TIGER/Line Shapefiles — 125 GB — geo — This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.
This data set is from the [ASA Statistical Computing / Statistical Graphics](http://stat-computing.org/dataexpo/2009/the-data.html) section 2009 contest, "Airline Flight Status — Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The data comes originally from the DOT’s [Research and Innovative Technology Administration (RITA)](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) group, where it is [described in detail](http://www.transtats.bts.gov/Fields.asp?Table_ID=236). You can download the original data there. The files here have <derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.
Here are a few ideas to get you started exploring the data:
-
When is the best time of day/day of week/time of year to fly to minimise delays?
-
Do older planes suffer more delays?
-
How does the number of people flying between different locations change over time?
-
How well does weather predict plane delays?
-
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
-
Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.
-
Dataexpo (Public domain): The core airline flights database includes
-
dataexpo_airports.tsv
(original) — info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data. -
dataexpo_airplanes.tsv
(original) — info on about 5030 US commercial airplanes by tail number. -
dataexpo_airlines.tsv
(original) — info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.
-
-
Wikipedia.org (CC-BY-SA license): Airport identifiers
The airport datasets contain errors and conflicts; we’ve done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv
shows where my patience wore out.
-
1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
20 GB geo, stats
US City Populations:
-
Retrosheet: MLB play-by-play, high detail, 1840-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
-
Retrosheet: MLB box scores, 1871-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
The main collection is about 650GB (as of October 2011) with nearly 2 million files, 60 languages, and dozens of different file formats.
-
Gutenberg collection catalog as RDF
-
rsync, repeatable: Mirroring How-To
-
wget, zip files: Gutenberg corpus download instructions. From a helpful Stack Overflow thread, you can run
wget -r -w 2 -m 'http://www.gutenberg.org/robot/harvest?filetypes=txt&langs[]=en'
-
see also ZIP file input format
-
cd /mnt/gutenberg/ ; mkdir -p logs/gutenberg nohup rsync -aviHS --max-size=10M --delete --delete-after --exclude=\*.{m4a,m4b,ogg,spx,tei,md5,db,mus,mid,rst,sib,xml,zip,pdf,jpg,jpeg,png,htm,html,mp3,rtf,wav,mov,mp4,avi,mpg,mpeg,doc,docx,xml,tex,ly,eps,iso,rar} --exclude={\*-h*,old,images} --exclude '*cache/generated' {[email protected]::gutenberg,/mnt/gutenberg/ftp.ibiblio.org}/2 >> /mnt/gutenberg/logs/gutenberg/rsync-gutenberg-`datename`-part_2.log 2>&1 &
-
http://www.nrc.gov/reading-rm/doc-collections/event-status/event/ US Nuclear Regulatory Commission, collection of "events" (i.e. things that went wrong). Can you find features and predictive patterns of oopsies?
character | meaning |
---|---|
TODO |
|
|
any character |
|
any word character: a-z, A-Z, 0-9 or |
|
any whitespace, whether space, tab ( |
|
|
|
the character with that hexadecimal encoding. |
|
word boundary (zero-width) |
|
start of line; use |
|
end of line; use |
|
match character in set |
|
reject characters in set |
|
a or b or c |
|
group |
|
non-capturing group |
|
named group |
|
zero or more, one or more. greedy (captures the longest possible match) |
|
non-greedy zero-or-more, non-greedy one-or-more |
|
repeats |
This shows the sort order of the simple (7-bit) ascii characters, along with the primary regular expression escapes they match.
Name | Hex | Representation | Word | Whitespace | Control |
---|---|---|---|---|---|
Control Characters I |
|
\c |
|||
Whitespace |
|
|
\s |
||
\c |
Control Characters II |
|
|||
\c |
Space |
|
` ` |
||
\s |
Punctuation I |
|
|||
|
Numerals |
||||
|
|
\w |
|||
Punctuation II |
|
|
|||
Uppercase |
|
|
\w |
||
Punctuation IV |
|
|
|||
Lowercase |
|
|
|||
\w |
Punctuation V |
|
|||
`{ |
}~` |
||||
Control Characters III |
|
\c |
These Example Regular Expressions are for practical extraction (identifying potential examples), not validation (ensuring correctness). They may let nitpicks through that oughtn’t: a time zone of -0000
is illegal by the spec, but will pass the date regexp given below. As always, modify them in your actual code to be as restrictively brittle as reasonable.
intent | Regular Expression | Comment |
---|---|---|
Double-quoted string |
`%r{"((?:\\. |
[^\"])*)"}` |
all backslash-escaped character, or non-quotes, up to first quote |
Decimal number with sign |
|
optional sign; digits-dot-digits |
Floating-point number |
|
optional sign; digits-dot-digits; optional exponent |
ISO date |
`%r{\b(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)([\+\-]\d\d:?\d\d |
[\+\-]\d\d |
Z)\b}` |
Capture groups are the year, month, day, hour, minute, second and time zone respectively. |
URL |
Email Address |
|
Emoticon |
(see sample code) |
action | command |
---|---|
list files |
|
list files' disk usage |
|
total HDFS usage/available |
visit namenode console |
copy local → HDFS |
|
copy HDFS → local |
|
copy HDFS → remote HDFS |
|
make a directory |
|
move/rename |
|
dump file to console |
|
remove a file |
|
remove a directory tree |
|
remove a file, skipping Trash |
|
empty the trash NOW |
|
health check of HDFS |
|
report block usage of files |
|
decommission nodes |
|
list running jobs |
|
kill a job |
|
kill a task attempt |
|
CPU usage by process |
|
Disk activity |
|
Network activity |
|
|
|
|
|
|
|
|
|
|
|
tuning |
csshX, htop, dstat, ulimit |
also useful: |
|
cat, echo, true, false, yes, tee, time, watch, time |
dos-to-unix line endings |
|
|
====== === Unix Filesystem Commands === |
| action | command | Flags
| Sort data | sort
| reverse the sort: -r
; sort numerically: -n
; sort on a field: -t [delimiter] -k [index]
| Sort large amount of data | sort --parallel=4 -S 500M
| use four cores and a 500 megabyte sort buffer
| Cut delimited field | cut -f 1,3-7 -d ','
| emit comma-separated fields one and three through seven
| Cut range of characters | cut -c 1,3-7
| emit characters one and three through seven
| Split on spaces | | ruby -ne 'puts $_.split(/\\s+/).join("\t")'
| split on continuous runs of whitespace, re-emit as tab-separated
| Distinct fields | | sort | uniq
| only dupes: -d
| Quickie histogram | | sort | uniq -c
| TODO: check the rendering for backslash
| Per-process usage | htop
| Installed
| Running system usage | dstat -drnycmf -t 5
| 5-second rolling system stats. You likely will have to install dstat yourself. If that’s not an option, use iostat -x 5 & sleep 3 ; ifstat 5
for an interleaved 5-second running average.
For example: `cat * |
cut -c 1-4 |
sort |
uniq -c` cuts the first 4-character Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin. === Pig Operators === |
======= |
action |
operator |
|
JOIN |
|||
FILTER |
======= === Hadoop Tunables Cheatsheet === SQL-to-Pig-to-Hive Cheatsheet * SELECT..WHERE
* SELECT…LIMit
* GROUP BY…HAVING
* SELECT WHERE… ORDER BY
* SELECT WHERE… SORT BY (just use reducer sort) ~~ (does reducer in Pig guarantee this?)
* SELECT … DISTRIBUTE BY … SORT BY …
* SELECT … CLUSTER BY (equiv of distribute by X sort by X)
* Indexing tips
* CASE…when…then
* Block Sampling / Input pruning
* SELECT country_name, indicator_name, SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC |