////This intro here is a good model for other chapters - this one’s rough, but the bones are here. Amy////
Much of Hadoop’s adoption is driven by organizations realizing they the opportunity to measure every aspect of their operation, unify those data sources, and act on the patterns that Hadoop and other Big Data tools uncover.////Share a few examples of things that can be measured (website hits, etc.) Amy////
For
e-commerce site, an advertising broker, or a hosting provider, there’s no wonder inherent in being able to measure every customer interaction, no controversy that it’s enormously valuable to uncovering patterns in those interactions, and no lack of tools to act on those patterns in real time.
can use the clickstream of interactions with each email ; this was one of the cardinal advantages cited in the success of Barack Obama’s 2012 campaign.
This chapter’s techniques will help, say, a hospital process the stream of data from every ICU patient; a retailer process the entire purchase-decision process from or a political campaign to understand and tune the response to each email batch and advertising placement.
Hadoop cut its teeth at Yahoo, where it was primarily used for processing internet-sized web crawls(see next chapter on text processing) and
Quite likely, server log processing either a) is the reason you got this book or b) seems utterly boring.////I think you should explain this for readers. Amy//// For the latter folks, stick with it; hidden in this chapter are basic problems of statistics (finding histogram of pageviews), text processing (regular expressions for parsing), and graphs (constructing the tree of paths that users take through a website).
We’ll represent loglines with the following model definition:
link:code/serverlogs/models/logline--fields.rb[role=include]
Since most of our questions are about what visitors do, we’ll mainly use visitor_id
(to identify common requests for a visitor), uri_str
(what they requested), requested_at
(when they requested it) and referer
(the page they clicked on). Don’t worry if you’re not deeply familiar with the rest of the fields in our model — they’ll become clear in context.
Two notes, though. In these explorations, we’re going to use the ip_address
field for the visitor_id
. That’s good enough if you don’t mind artifacts, like every visitor from the same coffee shop being treated identically. For serious use, though, many web applications assign an identifying "cookie" to each visitor and add it as a custom logline field. Following good practice, we’ve built this model with a visitor_id
method that decouples the semantics ("visitor") from the data ("the IP address they happen to have visited from"). Also please note that, though the dictionary blesses the term 'referrer', the early authors of the web used the spelling 'referer' and we’re now stuck with it in this context.
///Here is a great example of supporting with real-world analogies, above where you wrote, "like every visitor from the same coffee shop being treated identically…" Yes! That is the kind of simple, sophisticated tying-together type of connective tissue needed throughout, to greater and lesser degrees. Amy////
////Help the reader in by writing something quick, short, like, "The core nugget that you should know about simple log parsing is…" Amy////
Webserver logs typically follow some variant of the "Apache Common Log" format — a series of lines describing each web request:
154.5.248.92 - - [30/Apr/2003:13:17:04 -0700] "GET /random/video/Star_Wars_Kid.wmv HTTP/1.0" 206 176250 "-" "Mozilla/4.77 (Macintosh; U; PPC)"
Our first task is to leave that arcane format behind and extract healthy structured models. Since every line stands alone, the parse script is simple as can be: a transform-only script that passes each line to the Logline.parse
method and emits the model object it returns.
link:code/serverlogs/parser--processor.rb[role=include]
For sample data, we’ll use the webserver logs released by blogger Andy Baio. In 2003, he posted the famous "Star Wars Kid" video, which for several years ranked as the biggest viral video of all time. (It augments a teenager’s awkward Jedi-style fighting moves with the special effects of a real lightsaber.) Here’s his description:
I’ve decided to release the first six months of server logs from the meme’s spread into the public domain — with dates, times, IP addresses, user agents, and referer information. … On April 29 at 4:49pm, I posted the video, renamed to "Star_Wars_Kid.wmv" — inadvertently giving the meme its permanent name. (Yes, I coined the term "Star Wars Kid." It’s strange to think it would’ve been "Star Wars Guy" if I was any lazier.) From there, for the first week, it spread quickly through news site, blogs and message boards, mostly oriented around technology, gaming, and movies. …
This file is a subset of the Apache server logs from April 10 to November 26, 2003. It contains every request for my homepage, the original video, the remix video, the mirror redirector script, the donations spreadsheet, and the seven blog entries I made related to Star Wars Kid. I included a couple weeks of activity before I posted the videos so you can determine the baseline traffic I normally received to my homepage. The data is public domain. If you use it for anything, please drop me a note!
The details of parsing are mostly straightforward — we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:
link:code/serverlogs/models/logline--parse.rb[role=include]
It may look terrifying, but taken piece-by-piece it’s not actually that bad. Regexp-fu is an essential skill for data science in practice — you’re well advised to walk through it. Let’s do so.
-
The meat of each line describe the contents to match —
\S+
for "a sequence of non-whitespace",\d+
for "a sequence of digits", and so forth. If you’re not already familiar with regular expressions at that level, consult the excellent tutorial at regular-expressions.info. -
This is an 'extended-form' regular expression, as requested by the
x
at the end of it. An extended-form regexp ignores whitespace and treats#
as a comment delimiter — constructing a regexp this complicated would be madness otherwise. Be careful to backslash-escape spaces and hash marks. -
The
\A
and\z
anchor the regexp to the absolute start and end of the string respectively. -
Fields are selected using 'named capture group' syntax:
(?<ip_address>\S+)
. You can retrieve its contents usingmatch[:ip_address]
, or get all of them at once usingcaptures_hash
as we do in theparse
method. -
Build your regular expressions to be 'good brittle'. If you only expect HTTP request methods to be uppercase strings, make your program reject records that are otherwise. When you’re processing billions of events, those one-in-a-million deviants start occurring thousands of times.
That regular expression does almost all the heavy lifting, but isn’t sufficient to properly extract the requested_at
time. Wukong models provide a "security gate" for each field in the form of the receive_(field name)
method. The setter method (requested_at=
) applies a new value directly, but the receive_requested_at
method is expected to appropriately validate and transform the given value. The default method performs simple 'do the right thing'-level type conversion, sufficient to (for example) faithfully load an object from a plain JSON hash. But for complicated cases you’re invited to override it as we do here.
link:code/serverlogs/models/logline--requested_at.rb[role=include]
There’s a general lesson here for data-parsing scripts. Don’t try to be a hero and get everything done in one giant method. The giant regexp just coarsely separates the values; any further special handling happens in isolated methods.
Test the script in local mode:
link:code/serverlogs/output-parser-map.log[role=include]
Then run it on the full dataset to produce the starting point for the rest of our work:
TODO
////Ease the reader in with something like, "Our goal here will be…" Amy////
Let’s start exploring the dataset. Andy Baio
link:code/serverlogs/old/logline-02-histograms-mapper.rb[role=include]
We want to group on date_hr
, so just add a 'virtual accessor' — a method that behaves like an attribute but derives its value from another field:
link:code/serverlogs/old/logline-00-model-date_hr.rb[role=include]
This is the advantage of having a model and not just a passive sack of data.
Run it in map mode:
link:code/serverlogs/old/logline-02-histograms-02-mapper-wu-lign-sort.log[role=include]
TODO: digression about wu-lign
.
Sort and save the map output; then write and debug your reducer.
link:code/serverlogs/old/logline-02-histograms-full.rb[role=include]
When things are working, this is what you’ll see. Notice that the …/Star_Wars_Kid.wmv
file already have five times the pageviews as the site root (/
).
link:code/serverlogs/old/logline-02-histograms-03-reduce.log[role=include]
You’re ready to run the script in the cloud! Fire it off and you’ll see dozens of workers start processing the data.
link:code/serverlogs/old/logline-02-histograms-04-freals.log[role=include]
We can use the user logs to assemble a picture of how users navigate the site — 'sessionizing' their pageviews. Marketing and e-commerce sites have a great deal of interest in optimizing their "conversion funnel", the sequence of pages that visitors follow before filling out a contact form, or buying those shoes, or whatever it is the site exists to serve. Visitor sessions are also useful for defining groups of related pages, in a manner far more robust than what simple page-to-page links would define. A recommendation algorithm using those relations would for example help an e-commerce site recommend teflon paste to a visitor buying plumbing fittings, or help a news site recommend an article about Marilyn Monroe to a visitor who has just finished reading an article about John F Kennedy. Many commercial web analytics tools don’t offer a view into user sessions — assembling them is extremely challenging for a traditional datastore. It’s a piece of cake for Hadoop, as you’re about to see.
////This spot could be an effective place to say more about "Locality" and taking the reader deeper into thinking about that concept in context. Amy////
NOTE:[Take a moment and think about the locality: what feature(s) do we need to group on? What additional feature(s) should we sort with?]
spit out [ip, date_hr, visit_time, path]
.
link:code/serverlogs/old/logline-03-breadcrumbs-full.rb[role=include]
You might ask why we don’t partition directly on say both visitor_id
and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
run it in map mode:
link:code/serverlogs/old/logline-02-histograms-01-mapper.log[role=include]
link:code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[role=include]
group on user
link:code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[role=include]
We use the secondary sort so that each visit is in strict order of time within a session.
You might ask why that is necessary — surely each mapper reads the lines in order? Yes, but you have no control over what order the mappers run, or where their input begins and ends.
This script will accumulate multiple visits of a page.
TODO: say more about the secondary sort. ////This may sound wacky, but please try it out: use the JFK/MM exmaple again, here. Tie this all together more, the concepts, using those memorable people. I can explain this live, too. Amy////
What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We’ve been thinking about that as a table, but it’s also a graph — actually, a bunch of graphs! The serverlogs_affinity_graph describes an affinity graph, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the /archive/2003/04/03/typo_pop.shtml
page and the /archive/2003/04/29/star_war.shtml
page in the same visit, that’s one point towards their similarity. The chapter on [graph_processing] has lots of fun things to do with a graph like this, so for now we’ll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.
link:code/serverlogs/old/logline-04-page_page_edges-full.rb[role=include]
link:code/serverlogs/old/logline-04-page_page_edges-03-reducer.log[role=include]
First, you can think of it as an affinity graph pairing visitor sessions with pages. The Netflix prize motivated a lot of work to help us understand affinity graphs — in that case, a pairing of Netflix users with movies. Affinity graph-based recommendation engines simultaneously group similar users with similar movies (or similar sessions with similar pages). Imagine the following device. Set up two long straight lengths of wire, with beads that can slide along it. Beads on the left represent visitor sessions, ones on the right represent pages. These are magic beads, in that they can slide through each other, and they can clamp themselves to the wire. They are also slightly magnetic, so with no other tension they would not clump together but instead arrange themselves at some separated interval along the wire. Clamp all the beads in place for a moment and tie a small elastic string between each session bead and each page in that session. (These elastic bands also magically don’t interfere with each other). To combat the crawler-robot effect, choose tighter strings when there are few pages in the session, and weaker strings when there are lots of pages in the session. Once you’ve finished stringing this up, unclamp one of the session beads. It will snap to a position opposit the middle of all the pages it is tied to. If you now unclamp each of those page beads, they’ll move to sit opposite that first session bead. As you continue to unclamp all the beads, you’ll find that they organize into clumps along the wire: when a bunch of sessions link to a common set of pages, their mutal forces combine to drag them opposite each other. That’s the intuitive view; there are proper mathematical treatments, of course, for kind of co-clustering.
TODO: figure showing bipartite session-page graph
You can learn a lot about your site’s audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. [1]
Maxmind makes their GeoLite IP-to-geo database available under an open license (CC-BY-SA)[2]. Out of the box, its columns are beg_ip
, end_ip
, location_id
, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges.
This arrangement caters to range queries in a relational database, but isn’t suitable for our needs. A single IP-geo block can span thousands of addresses.
To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4
to 1.2.5.6
on one line, let’s use the first three quads (first 24 bits) and emit rows for 1.2.3.4
to 1.2.3.255
, 1.2.4.0
to 1.2.4.255
, and 1.2.5.0
to 1.2.5.6
. This lets us use the first segment as the partition key, and the full ip address as the sort key.
lines bytes description file 15_288_766 1_094_541_688 24-bit partition key maxmind-geolite_city-20121002.tsv 2_288_690 183_223_435 16-bit partition key maxmind-geolite_city-20121002-16.tsv 2_256_627 75_729_432 original (not denormalized) GeoLiteCity-Blocks.csv
////Gently introduce the concept. "So, here’s what range queries are all about, in a nutshell…" Amy////
This is a generally-applicable approach for doing range queries.
-
Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size.
-
Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval.
-
Emit a compound key of
[interval, join_handle, beg, end]
, where-
interval
is -
join_handle
identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder:[interval, join_handle, beg_suffix, end_suffix]
.
-
-
Use the
In the geodata section, the "quadtile" scheme is (if you bend your brain right) something of an extension on this idea — instead of splitting ranges on regular intervals, we’ll split regions on a regular grid scheme.
Hadoop is engineered to consume the full capacity of every available resource up to the currently-limiting one. So in general, you should never issue requests against external services from a Hadoop job — one-by-one queries against a database; crawling web pages; requests to an external API. The resulting load spike will effectively be attempting what web security folks call a "DDoS", or distributed denial of service attack.
Unless of course you are trying to test a service for resilience against an adversarial DDoS — in which case that assault is a feature, not a bug!
require 'faraday' processor :elephant_stampede do def process(logline) beg_at = Time.now.to_f resp = Faraday.get url_to_fetch(logline) yield summarize(resp, beg_at) end def summarize(resp, beg_at) duration = Time.now.to_f - beg_at bytesize = resp.body.bytesize { duration: duration, bytesize: bytesize } end def url_to_fetch(logline) logline.url end end flow(:mapper){ input > parse_loglines > elephant_stampede }
You must use Wukong’s eventmachine bindings to make more than one simultaneous request per mapper.