Appendixes

Appendix 1: Acquiring a Hadoop Cluster ===

Elastic Map-Reduce
Brew
Amazon
CDH and HortonWorks
MortarData and TreasureData

Appendix 2: Overview of Datasets

The examples in this book use a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. The datasets are:

Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article.
Wikipedia Pagelink Graph (wikipedia_pagelinks; ): every page-to-page hyperlink in wikipedia.
Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)): hour-by-hour pageviews for all of Wikipedia
ASA SC/SG Data Expo Airline Flights (airline_flights; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes.
NCDC Hourly Global Weather Measurements, 1929-2009 (ncdc_weather_hourly; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929.
1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
60,000 UFO Sightings
Retrosheet Game Logs — every recorded baseball game back to the 1890s
Gutenberg Corpus

Wikipedia Page Traffic Statistic V3

-
- a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

Twilio/Wigle.net Street Vector Data Set — — geo — Twilio/Wigle.net database of mapped US street names and address ranges.
2008 TIGER/Line Shapefiles — 125 GB — geo — This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.

ASA SC/SG Data Expo Airline Flights

This data set is from the [ASA Statistical Computing / Statistical Graphics](http://stat-computing.org/dataexpo/2009/the-data.html) section 2009 contest, "Airline Flight Status — Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The data comes originally from the DOT’s [Research and Innovative Technology Administration (RITA)](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) group, where it is [described in detail](http://www.transtats.bts.gov/Fields.asp?Table_ID=236). You can download the original data there. The files here have <derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.

Here are a few ideas to get you started exploring the data:

When is the best time of day/day of week/time of year to fly to minimise delays?
Do older planes suffer more delays?
How does the number of people flying between different locations change over time?
How well does weather predict plane delays?
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

Support data

Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.
- openflights_airports.tsv (original) — info on about 7000 airports.
- openflights_airlines.tsv (original) — info on about 6000 airline carriers
- openflights_routes.tsv (original) — info on about 60_000 routes between 3000 airports on 531 airlines.
Dataexpo (Public domain): The core airline flights database includes
- dataexpo_airports.tsv (original) — info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data.
- dataexpo_airplanes.tsv (original) — info on about 5030 US commercial airplanes by tail number.
- dataexpo_airlines.tsv (original) — info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.
Wikipedia.org (CC-BY-SA license): Airport identifiers
- wikipedia_airports_iata.tsv (original) — user-generated dataset pairing airports with their IATA (and often ICAO and FAA) identifiers.
- wikipedia_airports_icao.tsv(original) — user-generated dataset pairing airports with their ICAO (and often IATA and FAA) identifiers.

The airport datasets contain errors and conflicts; we’ve done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv shows where my patience wore out.

ITA World Cup Apache Logs

1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

20 GB geo, stats

Old Weather project —
http://philip.brohan.org.transfer.s3.amazonaws.com/oW_imma_20120801.tgz

Supplementary Data

US City Populations:

The 100 most populous cities\, Metropolitan Statistical Areas\, and Primary Statistical Areas of the United States and Puerto Rico
US Census — Annual Estimates of the Resident Population for Incorporated Places Over 50_000 Population

Retrosheet

Retrosheet: MLB play-by-play, high detail, 1840-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
Retrosheet: MLB box scores, 1871-2011 ripd/www.retrosheet.org-2007/boxesetc/2006

Gutenberg corpus

The main collection is about 650GB (as of October 2011) with nearly 2 million files, 60 languages, and dozens of different file formats.

Gutenberg collection catalog as RDF
rsync, repeatable: Mirroring How-To
wget, zip files: Gutenberg corpus download instructions. From a helpful Stack Overflow thread, you can run wget -r -w 2 -m 'http://www.gutenberg.org/robot/harvest?filetypes=txt&langs[]=en'
- see also ZIP file input format

    cd /mnt/gutenberg/ ; mkdir -p logs/gutenberg
    nohup rsync -aviHS --max-size=10M --delete --delete-after --exclude=\*.{m4a,m4b,ogg,spx,tei,md5,db,mus,mid,rst,sib,xml,zip,pdf,jpg,jpeg,png,htm,html,mp3,rtf,wav,mov,mp4,avi,mpg,mpeg,doc,docx,xml,tex,ly,eps,iso,rar} --exclude={\*-h*,old,images} --exclude '*cache/generated' {ftp@ftp.ibiblio.org::gutenberg,/mnt/gutenberg/ftp.ibiblio.org}/2 >> /mnt/gutenberg/logs/gutenberg/rsync-gutenberg-`datename`-part_2.log 2>&1 &

Complete works of William Shakespeare

Ideas for other datasets you could harvest

http://www.nrc.gov/reading-rm/doc-collections/event-status/event/ US Nuclear Regulatory Commission, collection of "events" (i.e. things that went wrong). Can you find features and predictive patterns of oopsies?

Appendix: Cheatsheets

Regular Expressions

http://www.debuggex.com/?re=big(.+)\bd…&str=big-data-for-chimps

Table 1. Regular Expression Cheatsheet

character	meaning

TODO

`.`	any character
`\w`	any word character: a-z, A-Z, 0-9 or `_` underscore. Use `[:word:]` to match extended alphanumeric characters (accented characters and so forth)
`\s`	any whitespace, whether space, tab (`\t`), newline (`\n`) or carriage return (`\r`).
`\d`
`\x42` (or any number)	the character with that hexadecimal encoding.
`\b`	word boundary (zero-width)
`^`	start of line; use `\\A` for start of string (disregarding newlines). (zero-width)
`$`	end of line; use `\\z` for end of string (disregarding newlines). (zero-width)
`[^a-mA-M]`	match character in set
`[a-mA-M]`	reject characters in set
`a\|b\|c`	a or b or c
`(…)`	group
`(?:…)`	non-capturing group
`(?<varname>…)`	named group

`*`, `+`	zero or more, one or more. greedy (captures the longest possible match)
`*?`, `+?`	non-greedy zero-or-more, non-greedy one-or-more
`{n,m}`	repeats `n` or more, but `m` or fewer times

Simple Characters

This shows the sort order of the simple (7-bit) ascii characters, along with the primary regular expression escapes they match.

Table 2. Example Regular Expressions

Name	Hex	Representation	Word	Whitespace	Control
Control Characters I	`\x00-\x08`				\c
	Whitespace	`\x09-\x0d`	`\t`, `\n` `\v`, `\f`, `\r`		\s
\c		Control Characters II	`\x0e-\x1f`
	\c		Space	`\x20`	` `
	\s			Punctuation I	`\x21-\x2f`
`!"#$%&'()*+,-./`					Numerals
`\x30-\x39`	`0-9`	\w
Punctuation II	`\x3a-\x40`	`:;<⇒?@`
	Uppercase	`\x41-\x5a`	`A-Z`	\w
		Punctuation IV	`\x5b-\x60`	[\\]^_`
			Lowercase	`\x61-\x7a`	`a-z`
\w				Punctuation V	`\x7b-\x7e`
`{	}~`
Control Characters III	`\x7f-\x80`				\c

Useful Regular Expression Examples

These Example Regular Expressions are for practical extraction (identifying potential examples), not validation (ensuring correctness). They may let nitpicks through that oughtn’t: a time zone of -0000 is illegal by the spec, but will pass the date regexp given below. As always, modify them in your actual code to be as restrictively brittle as reasonable.

Table 3. Example Regular Expressions

intent	Regular Expression	Comment
Double-quoted string	`%r{"((?:\\.	[^\"])*)"}`
all backslash-escaped character, or non-quotes, up to first quote	Decimal number with sign	`%r{(\.\d)}`
optional sign; digits-dot-digits	Floating-point number	`%r{([+\-]?\d+\.\d+(?:[eE][+\-]?\d+)?)}`
optional sign; digits-dot-digits; optional exponent	ISO date	`%r{\b(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)([\+\-]\d\d:?\d\d
[\+\-]\d\d	Z)\b}`	Capture groups are the year, month, day, hour, minute, second and time zone respectively.
URL		Email Address
	Emoticon	(see sample code)

Hadoop Filesystem Commands

Table 4. Hadoop Filesystem Commands

action	command

list files	`hadoop fs -ls`
list files' disk usage	`hadoop fs -du`
total HDFS usage/available	visit namenode console


copy local → HDFS
copy HDFS → local
copy HDFS → remote HDFS

make a directory	`hadoop fs -mkdir ${DIR}`
move/rename	`hadoop fs -mv ${FILE}`
dump file to console	`hadoop fs -cat ${FILE} \| cut -c 10000 \| head -n 10000`


remove a file
remove a directory tree
remove a file, skipping Trash
empty the trash NOW

health check of HDFS
report block usage of files

decommission nodes


list running jobs
kill a job
kill a task attempt


CPU usage by process	`htop`, or `top` if that’s not installed
Disk activity
Network activity

	`grep -e '[regexp]'`
	`head`, `tail`
	`wc`
	`uniq -c`
	`sort -n -k2`
tuning	csshX, htop, dstat, ulimit
	also useful:
cat, echo, true, false, yes, tee, time, watch, time	dos-to-unix line endings
`ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'`

	====== === Unix Filesystem Commands === .UNIX commandline tricks [options="header"]

For example: `cat *	cut -c 1-4	sort	uniq -c` cuts the first 4-character Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin. === Pig Operators === .Pig Operator Cheatsheet [options="header"]
=======	action	operator
		JOIN
FILTER			======= === Hadoop Tunables Cheatsheet === SQL-to-Pig-to-Hive Cheatsheet * SELECT..WHERE * SELECT…LIMit * GROUP BY…HAVING * SELECT WHERE… ORDER BY * SELECT WHERE… SORT BY (just use reducer sort) ~~ (does reducer in Pig guarantee this?) * SELECT … DISTRIBUTE BY … SORT BY … * SELECT … CLUSTER BY (equiv of distribute by X sort by X) * Indexing tips * CASE…when…then * Block Sampling / Input pruning * SELECT country_name, indicator_name, `2011` AS trade_2011 FROM wdi WHERE (indicator_name = 'Trade (% of GDP)' OR indicator_name = 'Broad money (% of GDP)') AND `2011` IS NOT NULL CLUSTER BY indicator_name; SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XX80-appendixes.asciidoc

XX80-appendixes.asciidoc

Appendixes

Appendix 1: Acquiring a Hadoop Cluster ===

Appendix 2: Overview of Datasets

Wikipedia Page Traffic Statistic V3

ASA SC/SG Data Expo Airline Flights

Support data

ITA World Cup Apache Logs

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

Supplementary Data

Retrosheet

Gutenberg corpus

Ideas for other datasets you could harvest

Appendix: Cheatsheets

Regular Expressions

Simple Characters

Useful Regular Expression Examples

Hadoop Filesystem Commands

Files

XX80-appendixes.asciidoc

Latest commit

History

XX80-appendixes.asciidoc

File metadata and controls

Appendixes

Appendix 1: Acquiring a Hadoop Cluster ===

Appendix 2: Overview of Datasets

Wikipedia Page Traffic Statistic V3

ASA SC/SG Data Expo Airline Flights

Support data

ITA World Cup Apache Logs

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

Supplementary Data

Retrosheet

Gutenberg corpus

Ideas for other datasets you could harvest

Appendix: Cheatsheets

Regular Expressions

Simple Characters

Useful Regular Expression Examples

Hadoop Filesystem Commands