Skip to content

Latest commit

 

History

History
529 lines (373 loc) · 28.8 KB

XX80-appendixes.asciidoc

File metadata and controls

529 lines (373 loc) · 28.8 KB

Appendixes

Appendix 1: Acquiring a Hadoop Cluster ===

  • Elastic Map-Reduce

  • Brew

  • Amazon

  • CDH and HortonWorks

  • MortarData and TreasureData

Appendix 2: Overview of Datasets

The examples in this book use a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. The datasets are:

  • Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article.

  • Wikipedia Pagelink Graph (wikipedia_pagelinks; ): every page-to-page hyperlink in wikipedia.

  • Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)): hour-by-hour pageviews for all of Wikipedia

  • ASA SC/SG Data Expo Airline Flights (airline_flights; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes.

  • NCDC Hourly Global Weather Measurements, 1929-2009 (ncdc_weather_hourly; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929.

  • 1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

  • 60,000 UFO Sightings

  • Retrosheet Game Logs — every recorded baseball game back to the 1890s

  • Gutenberg Corpus

Wikipedia Page Traffic Statistic V3

-
- a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
  • Twilio/Wigle.net Street Vector Data Set —  — geo — Twilio/Wigle.net database of mapped US street names and address ranges.

  • 2008 TIGER/Line Shapefiles — 125 GB — geo — This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.

ASA SC/SG Data Expo Airline Flights

This data set is from the [ASA Statistical Computing / Statistical Graphics](http://stat-computing.org/dataexpo/2009/the-data.html) section 2009 contest, "Airline Flight Status — Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The data comes originally from the DOT’s [Research and Innovative Technology Administration (RITA)](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) group, where it is [described in detail](http://www.transtats.bts.gov/Fields.asp?Table_ID=236). You can download the original data there. The files here have <derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.

Here are a few ideas to get you started exploring the data:

  • When is the best time of day/day of week/time of year to fly to minimise delays?

  • Do older planes suffer more delays?

  • How does the number of people flying between different locations change over time?

  • How well does weather predict plane delays?

  • Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

Support data
  • Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.

    • openflights_airports.tsv (original) — info on about 7000 airports.

    • openflights_airlines.tsv (original) — info on about 6000 airline carriers

    • openflights_routes.tsv (original) — info on about 60_000 routes between 3000 airports on 531 airlines.

  • Dataexpo (Public domain): The core airline flights database includes

    • dataexpo_airports.tsv (original) — info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data.

    • dataexpo_airplanes.tsv (original) — info on about 5030 US commercial airplanes by tail number.

    • dataexpo_airlines.tsv (original) — info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.

  • Wikipedia.org (CC-BY-SA license): Airport identifiers

    • wikipedia_airports_iata.tsv (original) — user-generated dataset pairing airports with their IATA (and often ICAO and FAA) identifiers.

    • wikipedia_airports_icao.tsv(original) — user-generated dataset pairing airports with their ICAO (and often IATA and FAA) identifiers.

The airport datasets contain errors and conflicts; we’ve done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv shows where my patience wore out.

ITA World Cup Apache Logs

  • 1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)

20 GB geo, stats

Retrosheet

Gutenberg corpus

The main collection is about 650GB (as of October 2011) with nearly 2 million files, 60 languages, and dozens of different file formats.

    cd /mnt/gutenberg/ ; mkdir -p logs/gutenberg
    nohup rsync -aviHS --max-size=10M --delete --delete-after --exclude=\*.{m4a,m4b,ogg,spx,tei,md5,db,mus,mid,rst,sib,xml,zip,pdf,jpg,jpeg,png,htm,html,mp3,rtf,wav,mov,mp4,avi,mpg,mpeg,doc,docx,xml,tex,ly,eps,iso,rar} --exclude={\*-h*,old,images} --exclude '*cache/generated' {[email protected]::gutenberg,/mnt/gutenberg/ftp.ibiblio.org}/2 >> /mnt/gutenberg/logs/gutenberg/rsync-gutenberg-`datename`-part_2.log 2>&1 &

Ideas for other datasets you could harvest

Appendix: Cheatsheets

Regular Expressions

Table 1. Regular Expression Cheatsheet
character meaning

TODO

.

any character

\w

any word character: a-z, A-Z, 0-9 or _ underscore. Use [:word:] to match extended alphanumeric characters (accented characters and so forth)

\s

any whitespace, whether space, tab (\t), newline (\n) or carriage return (\r).

\d

\x42 (or any number)

the character with that hexadecimal encoding.

\b

word boundary (zero-width)

^

start of line; use \\A for start of string (disregarding newlines). (zero-width)

$

end of line; use \\z for end of string (disregarding newlines). (zero-width)

[^a-mA-M]

match character in set

[a-mA-M]

reject characters in set

a|b|c

a or b or c

(…​)

group

(?:…​)

non-capturing group

(?<varname>…​)

named group

*, +

zero or more, one or more. greedy (captures the longest possible match)

*?, +?

non-greedy zero-or-more, non-greedy one-or-more

{n,m}

repeats n or more, but m or fewer times

Simple Characters

This shows the sort order of the simple (7-bit) ascii characters, along with the primary regular expression escapes they match.

Table 2. Example Regular Expressions
Name Hex Representation Word Whitespace Control

Control Characters I

\x00-\x08

\c

Whitespace

\x09-\x0d

\t, \n \v, \f, \r

\s

\c

Control Characters II

\x0e-\x1f

\c

Space

\x20

` `

\s

Punctuation I

\x21-\x2f

!"#$%&'()*+,-./

Numerals

\x30-\x39

0-9

\w

Punctuation II

\x3a-\x40

:;<⇒?@

Uppercase

\x41-\x5a

A-Z

\w

Punctuation IV

\x5b-\x60

[\\]^_`

Lowercase

\x61-\x7a

a-z

\w

Punctuation V

\x7b-\x7e

`{

}~`

Control Characters III

\x7f-\x80

\c

Useful Regular Expression Examples

These Example Regular Expressions are for practical extraction (identifying potential examples), not validation (ensuring correctness). They may let nitpicks through that oughtn’t: a time zone of -0000 is illegal by the spec, but will pass the date regexp given below. As always, modify them in your actual code to be as restrictively brittle as reasonable.

Table 3. Example Regular Expressions
intent Regular Expression Comment

Double-quoted string

`%r{"((?:\\.

[^\"])*)"}`

all backslash-escaped character, or non-quotes, up to first quote

Decimal number with sign

%r{(\.\d)}

optional sign; digits-dot-digits

Floating-point number

%r{([+\-]?\d+\.\d+(?:[eE][+\-]?\d+)?)}

optional sign; digits-dot-digits; optional exponent

ISO date

`%r{\b(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)([\+\-]\d\d:?\d\d

[\+\-]\d\d

Z)\b}`

Capture groups are the year, month, day, hour, minute, second and time zone respectively.

URL

Email Address

Emoticon

(see sample code)

Hadoop Filesystem Commands

Table 4. Hadoop Filesystem Commands
action command

list files

hadoop fs -ls

list files' disk usage

hadoop fs -du

total HDFS usage/available

visit namenode console

copy local → HDFS

copy HDFS → local

copy HDFS → remote HDFS

make a directory

hadoop fs -mkdir ${DIR}

move/rename

hadoop fs -mv ${FILE}

dump file to console

hadoop fs -cat ${FILE} | cut -c 10000 | head -n 10000

remove a file

remove a directory tree

remove a file, skipping Trash

empty the trash NOW

health check of HDFS

report block usage of files

decommission nodes

list running jobs

kill a job

kill a task attempt

CPU usage by process

htop, or top if that’s not installed

Disk activity

Network activity

grep -e '[regexp]'

head, tail

wc

uniq -c

sort -n -k2

tuning

csshX, htop, dstat, ulimit

also useful:

cat, echo, true, false, yes, tee, time, watch, time

dos-to-unix line endings

ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'

======

=== Unix Filesystem Commands ===

.UNIX commandline tricks [options="header"]

| action | command | Flags | Sort data | sort | reverse the sort: -r; sort numerically: -n; sort on a field: -t [delimiter] -k [index] | Sort large amount of data | sort --parallel=4 -S 500M | use four cores and a 500 megabyte sort buffer | Cut delimited field | cut -f 1,3-7 -d ',' | emit comma-separated fields one and three through seven | Cut range of characters | cut -c 1,3-7 | emit characters one and three through seven | Split on spaces | | ruby -ne 'puts $_.split(/\\s+/).join("\t")' | split on continuous runs of whitespace, re-emit as tab-separated | Distinct fields | | sort | uniq | only dupes: -d | Quickie histogram | | sort | uniq -c | TODO: check the rendering for backslash | Per-process usage | htop | Installed | Running system usage | dstat -drnycmf -t 5 | 5-second rolling system stats. You likely will have to install dstat yourself. If that’s not an option, use iostat -x 5 & sleep 3 ; ifstat 5 for an interleaved 5-second running average.

For example: `cat *

cut -c 1-4

sort

uniq -c` cuts the first 4-character

Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin.

=== Pig Operators ===

.Pig Operator Cheatsheet [options="header"]

=======

action

operator

JOIN

FILTER

=======

=== Hadoop Tunables Cheatsheet

=== SQL-to-Pig-to-Hive Cheatsheet

* SELECT..WHERE * SELECT…​LIMit * GROUP BY…​HAVING * SELECT WHERE…​ ORDER BY * SELECT WHERE…​ SORT BY (just use reducer sort) ~~ (does reducer in Pig guarantee this?) * SELECT … DISTRIBUTE BY … SORT BY …​ * SELECT …​ CLUSTER BY (equiv of distribute by X sort by X) * Indexing tips * CASE…​when…​then * Block Sampling / Input pruning * SELECT country_name, indicator_name, 2011 AS trade_2011 FROM wdi WHERE (indicator_name = 'Trade (% of GDP)' OR indicator_name = 'Broad money (% of GDP)') AND 2011 IS NOT NULL CLUSTER BY indicator_name;

SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC