Space doesn’t allow treating HBase in any depth, but it’s worth equipping you with a few killer dance moves for the most important part of using it well: data modeling. It’s also good for your brain — optimizing data at rest presents new locality constraints, dual to the ones you’ve by now mastered for data in motion. ////If it’s true that readers may get something crutial out of first playing with the tool before reading this chapter, say so here as a suggestion. Amy////So please consult other references (like "HBase: The Definitive Guide" (TODO:reference) or the free HBase Book online), load a ton of data into it, play around, then come back to enjoy this chapter.
You’re probably familiar with some database or another: MySQL, MongoDB, Oracle and so forth. These are passenger vehicles of various sorts, with a range of capabilities and designed for the convenience of the humans that use them. HBase is not a passenger vehicle — it is a big, powerful dump truck. It has no A/C, no query optimizer and it cannot perform joins or groups. You don’t drive this dump truck for its ergonomics or its frills; you drive it because you need to carry a ton of raw data-mining ore to the refinery. Once you learn to play to its strengths, though, you’ll find it remarkably powerful.
Here is most of what you can ask HBase to do, roughly in order of efficiency:
-
Given a row key: get, put or delete a single value into which you’ve serialized a whole record.
-
Given a row key: get, put or delete a hash of column/value pairs, sorted by column key.
-
Given a key: find the first row whose key is equal or larger, and read a hash of column/value pairs (sorted by column key).
-
Given a row key: atomically increment one or several counters and receive their updated values.
-
Given a range of row keys: get a hash of column/value pairs (sorted by column key) from each row in the range. The lowest value in the range is examined, but the highest is not. (If the amount of data is small and uniform for each row, the performance this type of query isn’t too different from case (3). If there are potentially many rows or more data than would reasonably fit in one RPC call, this becomes far less performant.)
-
Feed a map/reduce job by scanning an arbitrarily large range of values.
That’s pretty much it! There are some conveniences (versioning by timestamp, time-expirable values, custom filters, and a type of vertical partitioning known as column families); some tunables (read caching, fast rejection of missing rows, and compression); and some advanced features, not covered here (transactions, and a kind of stored procedures/stored triggers called coprocessors). For the most part, however, those features just ameliorate the access patterns listed above.
Here’s a partial list of features you do not get in HBase:
-
efficient querying or sorting by cell value
-
group by, join or secondary indexes
-
text indexing or string matching (apart from row-key prefixes)
-
arbitrary server-side calculations on query
-
any notion of a datatype apart from counters; everything is bytes in/bytes out
-
auto-generated serial keys
Sometimes you can partially recreate those features, and often you can accomplish the same tasks you’d use those features for, but only with significant constraints or tradeoffs. (You can pick up the kids from daycare in a dump truck, but only an idiot picks up their prom date in a dump truck, and in neither case is it the right choice).
More than most engineering tools, it’s essential to play to HBase’s strengths, and in general the simpler your schema the better HBase will serve you. Somehow, though, the sparsity of its feature set amplifies the siren call of even those few features. Resist, Resist. The more stoicly you treat HBase’s small feature set, the better you will realize how surprisingly large HBase’s solution set is.
A good HBase data model is "designed for reads", and your goal is to make one read per customer request (or as close as possible). If you do so, HBase will yield response times on the order of 1ms for a cache hit and 10ms for a cache miss, even with billions of rows and millions of columns.
An HBase data mode is typically designed around multiple tables, each serving one or a small number of online queries or batch jobs. There are the questions to ask:
-
What query do you want to make that must happen at milliseconds speed?
-
There are a set of related queries or batch jobs — which would you like to be efficient?
If you are using it primarily for batch use,
-
What is the batch job you are most interested in simplifying?
-
There are a set of related queries or batch jobs — which would you like to be efficient?
Let’s sketch the implementation of an autocomplete API on Wikipedia page titles, an example that truly plays to HBase’s strengths. As a visitor types characters into a search bar, the browser will request a JSON-encoded list of the top 10 most likely completions for that prefix. Responsiveness is essential: at most 50 milliseconds end-to-end response time. Several approaches might spring to mind, like a range query on titles; a prefix query against a text search engine; or a specialized "trie" datastructure. HBase provides a much stupider, far superior solution.
Instead, we’ll enumerate every possible completion. This blows the dataset into the billion-row range, but it makes each request a highly cache-efficient key/value lookup. Given an average title length of (TODO: insert numbers), the full completion set weighs in at "only" (TODO: numbers) rows and XXX raw data size — a walk in the park for HBase. (A sketch of how you might do this: first, use Pig to join on the pagerank table (see TODO: ref) to attach a "prominence" to each page. Next, write a map-reduce job to group the prefixes. The mapper takes each title and emits the first three, four, five, up to say twelve characters along with the pagerank. Use the prefix as partition key, and the prefix-rank as a descending sort key. Within each prefix group, the first ten records will be the ten most prominent completions; store them as a JSON-ized list and ignore all following completions for that prefix.)
What will we store into HBase? Your first instinct might be to store each of the ten titles, each in its own cell. Reasonable, but still too clever. Instead, serialize the full JSON-encoded response as a single value. This minimizes the cell count (memory- and disk-efficient), lets the API front end put the value straight onto the wire (speed and lines-of-code efficient), and puts us in the most efficient access pattern: single row, single value.
table |
row key |
column family |
column qualifier |
value |
options |
title_autocomp |
|
'j' |
|
JSON-encoded response |
|
In the autocomplete example, many requests will be for non-existent rows (eg "hdaoop"). These will of course be cache misses (there’s nothing to cache), making the queries not just useless but also costly. Luckily, there’s a spec ialized data structure known as a "Bloom Filter" that lets you very efficiently test set membership. If you explicitly enable it [1], HBase will capture all row keys into a Bloom Filter. On each request, it will quickly make sure it’s worth trying to retrieve a value before doing so. Data blocks for lame prefixes (hda…
) will be left unread, so that blocks for fecund prefixes (had…
) can be kept in RAM.
There’s another reason HBase is a great match for this problem: row locality. HBase stores all rows in sorted order on disk, so when a visitor has typed chim, the rows for chime
and chimp
and so forth are nearby on disk. Whatever next character the visitor types, the operating system is likely to have the right block hot in cache.
That also makes the autocomplete table especially well-suited for compression. Compression drives down the data size, which of course economizes disk capacity — more importantly, though, it means that the drive head has less data to seek past, and the IO bus has less data to stream off disk. Row locality often means nearby data elements are highly repetitive (definitely true here), so you realize a great compression ratio. There are two tradeoffs: first, a minor CPU hit to decompress the data; worse though, that you must decompress blocks at a time even if you only want one cell. In the case of autocomplete, row locality means you’re quite likely to use some of those other cells.
For our next example, let’s look at geographic data: the Geonames dataset of places, the Natural Earth dataset of region boundaries, and our Voronoi-spatialized version of the NCDC weather observations (TODO: ref).
We require two things. First, direct information about each feature. Here no magic is called for: compose a row key from the feature type and id, and store the full serialized record as the value. It’s important to keep row keys short and sortable, so map the region types to single-byte ids (say, a
for country, b
for admin 1, etc) and use standard ISO identifiers for the region id (us
for the USA, dj
for Djibouti, etc).
More interestingly, we would like a "slippy map" (eg Google Maps or Leaflet) API: given the set of quadtiles in view, return partial records (coordinates and names) for each feature. To ensure a responsive user experience, we need low latency, concurrent access and intelligent caching — HBase is a great fit.
The boundaries dataset gives coordinates for continents, countries, states ("admin1"), and so forth. In (TODO: ref the Geographic Data chapter), we fractured those boundaries into quadtiles for geospatial analysis, which is the first thing we need.
We need to choose a base zoom level: fine-grained enough that the records are of manageable size to send back to the browser, but coarse-grained enough that we don’t flood the database with trivial tiles ("In Russia". "Still in Russia". "Russia, next 400,000 tiles"…). Consulting the (TODO: ref "How big is a Quadtile") table, zoom level 13 means 67 million quadtiles, each about 4km per side; this is a reasonable balance based on our boundary resoluion.
ZL recs @64kB/qk reference size 12 17 M 1 TB Manhattan 13 67 M 4 TB 14 260 M 18 TB about 2 km per side 15 1024 M 70 TB about 1 km per side
For API requests at finer zoom levels, we’ll just return the ZL 13 tile and crop it (at the API or browser stage). You’ll need to run a separate job (not described here, but see the references (TODO: ref migurski boundary thingy)) to create simplified boundaries for each of the coarser zoom levels. Store these in HBase with three-byte row keys built from the zoom level (byte 1) and the quadtile id (bytes 2 and 3); the value should be the serialized GeoJSON record we’ll serve back.
We want to serve several kinds of regions: countries, states, metropolitan areas, counties, voting districts and so forth. It’s reasonable for a request to specify one, some combination or all of the region types, and so given our goal of "one read per client request" we should store the popular region types in the same table. The most frequent requests will be for one or two region types, though.
HBase lets you partition values within a row into "Column Families". Each column family has its own set of store files and bloom filters and block cache (TODO verify caching details), and so if only a couple column families are requested, HBase can skip loading the rest [2].
We’ll store each region type (using the scheme above) as the column family, and the feature ID (us
, jp
, etc) as the column qualifier. This means I can
-
request all region boundaries on a quadtile by specifying no column constraints
-
request country, state and voting district boundaries by specifying those three column families
-
request only Japan’s boundary on the quadtile by specifying the column key
a:jp
Most client libraries will return the result as a hash mapping column keys (combined family and qualifier) to cell values; it’s easy to reassemble this into a valid GeoJSON feature collection without even parsing the field values.
HBase tutorials generally have to introduce column families early, as they’re present in every request and when you define your tables. This unfortunately makes them seem far more prominent and useful than they really are. They should be used only when clearly required: they incur some overhead, and they cause some internal processes to become governed by the worst-case pattern of access among all the column families in a row. So consider first whether separate tables, a scan of adjacent rows, or just plain column qualifiers in one family would work. Tables with a high write impact shouldn’t use more than two or three column families, and no table should use more than a handful.
The Geonames dataset has 7 million points of interest spread about the globe.
Rendering these each onto quadtiles at some resolution, as we did above, is fine for slippy-map rendering. But if we could somehow index points at a finer resolution, developers would have a simple effective way to do "nearby" calculations.
At zoom level 16, each quadtile covers about four blocks, and its packed quadkey exactly fills a 32-bit integer; this seems like a good choice. We’re not going to rendering all the ZL16 quadtiles though — that would require 4 billion rows.
Instead, we’ll render each point as its own row, indexed by the row key quadtile_id16-feature_id
. To see the points on any given quadtile, I just need to do a row scan from the quadkey index of its top left corner to that of its bottom right corner (both left-aligned).
012100-a 012100-b 012101-c 012102-d 012102-e 012110-f 012121-g 012121-h 012121-i 012123-j 012200-k
To find all the points in quadtile 0121
, scan from 012100
to 012200
(returning a
through j
). Scans ignore the last index in their range, so k
is excluded as it should be. To find all the points in quadtile 012 121
, scan from 012121
to 012122
(returning g
, h
and i
).
Don’t store the quadkeys as the base-4 strings that we use for processing: the efficiency gained by packing them into 16- or 32-bit integers is worth the trouble. The quadkey '12301230' is eight bytes as the string "12301230", two bytes as the 16-bit integer 27756.
Note
|
When you are using this "Rows as Columns" technique, or any time you’re using a scan query, make sure you set "scanner caching" on. It’s an incredibly confusing name (it does not control a "Cache of scanner objects"). Instead think of it as "Batch Size", allowing may rows of data to be sent per network call. |
Typically with a keyspace this sparse you’d use a bloom filter, but we won’t be doing direct gets and so it’s not called for here (Bloom Filters are not consulted in a scan).
Use column families to hold high, medium and low importance points; at coarse zoom levels only return the few high-prominence points, while at fine zoom levels they would return points from all the column families
There are many kinds of features, and some of them are distinctly more populous and interesting. Roughly speaking, geonames features
-
A
(XXX million): Political features (states, counties, etc) -
H
(XXX million): Water-related features (rivers, wells, swamps,…) -
P
(XXX million): Populated places (city, county seat, capitol, …) -
…
-
R
(): road, railroad, … -
S
(): Spot, building, farm -
…
Very frequently, we only want one feature type: only cities, or only roads common to want one, several or all at a time.
You could further nest the feature codes. To do a scan of columns in a single get, need to use a ColumnPrefixFilter
The weatherstation regions table is most interesting of all.
map from weather station to quadkeys, pre-calculated map from observation to quadkeys, accumulate on tile
We want to serve boundaries out in tiles, but records are heavyweight.
if we store whole globe at ZL 14 (2 km blocks), 1kb record size becomes 275 GB data. Multiply by the hours in 50 years (50 * 365.25 * 24 = 438,000 hours = PB.
20,000 weather stations 1 M records = 50x data size; 10 TB becomes 0.5 PB.
0111230~~ 011123100 011123101 011123102 011123103 01112311~ 011123120 011123121 011123122 011123123 01112313~ ... 011130~~~
Retrieve the next existing tile. It’s a one-row operation, but we specify a range from specific tile to max tile ID.
The next tile is either the speific one with that key, or the first parent.
Note: next interesting record doesn’t use bloom filter
To do a range on zoomed-out, do a range from
want to scan all cells in 011 123
. this means 011 123 000
to 011 123 ~
.
table |
row key |
column family |
column qualifier |
value |
options |
region_info |
|
'r' |
(none) |
serialized record |
|
geonames_info |
|
'i' |
(none) |
serialized record |
|
tile_bounds |
|
(region type) |
|
Geo-JSON encoded path |
|
tile_places |
|
(feature class) |
|
name |
|
Hadoop was developed largely to process and analyze high-scale server logs for Nutch and Yahoo!. The recent addition of real-time streaming data tools like Storm+Kafka to the Hadoop/HBase ecosystem unlocks transformative new ways to see your data. It’s not just that it’s real-time; it’s that its multi-latency. As long as you provision enough capacity, you can make multiple writes to the database (letting you "optimize for reads"); execute transactional requests against legacy datastores; ping YouTube or Twitter or other only-mostly-dependable external APIs; and much more. All of a sudden some of your most cumbersome or impractical batch jobs become simple, reliable stream decorators. From where we stand, a best-of-class big data stack has three legs: Hadoop, one or more scalable databases, and multi-latency streaming analytics.
A high-volume website might have 2 million unique daily visitors, causing 100 M requests/day on average (4000 requests/second peak), and say 600 bytes per log line from 20-40 servers. Over a year, that becomes about 40 billion records and north of 20 terabytes of raw data. Feed that to most databases and they will crumble. Feed it to HBase and it will smile, belch and ask for seconds and thirds — which in fact we will. Designing for reads means aggressively denormalizing data, to an extent that turns the stomach and tests the will of traditional database experts. Use a streaming data pipeline such as Storm+Kafka or Flume, or a scheduled batch job, to denormalize the data.
Webserver log lines contain these fields: ip_address
, cookie
(a unique ID assigned to each visitor), url
(the page viewed), and referer_url
(the page they arrived from), status_code
(success or failure of request) and duration
(time taken to render page). We’ll add a couple more fields as we go along.
We’d like to understand user journeys through the site:
(Here’s what you should not do: use a row key of timebucket-cookie
; see [adjacency_bad_good]
The
To sort the values in descending timestamp order, instead use a reverse timestamp: LONG_MAX - timestamp
.
(You can’t simply use the negative of timestamp
— since sorts are always lexicographic, -1000
sorts before -9999
.)
By using a row key of cookie-rev_time
-
we can scan with a prefix of just the cookie to get all pageviews per visitor ever.
-
we can scan with a prefix of the cookie, limit one row, to get only the most recent session.
-
if all you want are the distinct pages (not each page view), specify versions = 1 in your request.
-
In a map-reduce job, using the column key and the referring page url gives a graph view of the journey; using the column key and the timestamp gives a timeseries view of the journey.
Row keys determine data locality. When activity is focused on a set of similar and thus adjacent rows, it can be very efficient or very problematic.
Adjacency is good: Most of the time, adjacency is good (hooray locality!). When common data is stored together, it enables
- range scans: retrieve all pageviews having the same path prefix, or a continuous map region.
- sorted retrieval: ask for the earliest entry, or the top-k
rated entries
- space-efficient caching: map cells for New York City will be much more commonly referenced than those for Montana. Storing records for New York City together means fewer HDFS blocks are hot, which means the opeerating system is better able to cache those blocks.
- time-efficient caching: if I retrieve the map cell for Minneapolis, I’m much more likely to next retrieve the adjacent cell for nearby St. Paul. Adjacency means that cell will probably be hot in the cache.
Adjacency is bad: if everyone targets a narrow range of keyspace, all that activity will hit a single regionserver and your wonderful massively-distributed database will limp along at the speed of one abused machine.
This could happen because of high skew: for example, if your row keys were URL paths, the pages in the /product
namespace would see far more activity than pages under laborday_2009_party/photos
(unless they were particularly exciting photos). Similarly, a phenomenon known as Benford’s law means that addresses beginning with '1' are far more frequent than addresses beginning with '9' [3]. In this case, managed splitting (pre-assigning a rough partition of the keyspace to different regions) is likely to help.
Managed splitting won’t help for timestamp keys and other monotonically increasing values though, because the focal point moves constantly. You’d often like to spread the load out a little, but still keep similar rows together. Options include:
-
swap your first two key levels. If you’re recording time series metrics, use
metric_name-timestamp
, nottimestamp-metric_name
, as the row key. -
add some kind of arbitrary low-cardinality prefix: a server or shard id, or even the least-significant bits of the row key. To retrieve whole rows, issue a batch request against each prefix at query time.
You could also track the most recently-viewed pages directly. In the cookie_stats
table, add a column family r
having VERSIONS: 5
. Now each time the visitor loads a page, write to that exact value;
HBase store files record the timestamp range of their contained records. If your request is limited to values less than one hour old, HBase can ignore all store files older than that.
It’s often best to store URLs in "domain-reversed" form, where the hostname segments are placed in reverse order: eg "org.apache.hbase/book.html" for "hbase.apache.org/book.html". The domain-reversed URL orders pages served from different hosts within the same organization ("org.apache.hbase" and "org.apache.kafka" and so forth) adjacently.
To get a picture of inbound traffic
One of the elephants recounts this tale:
In my land it’s essential that every person’s prayer be recorded.
One is to have diligent monks add a a grain of rice to a bowl on each event, then in daily ritual recount them from beginning to end. You and I might instead use a threadsafe [UUID](http://en.wikipedia.org/wiki/Universally_unique_identifier) library to create a guaranteed-unique ID.
However, neither grains of rice nor time-based UUIDs can easily be put in time order. Since monks may neither converse (it’s incommensurate with mindfulness) nor own fancy wristwatches (vow of poverty and all that), a strict ordering is impossible. Instead, a monk writes on each grain of rice the date and hour, his name, and the index of that grain of rice this hour. You can read a great writeup of distributed UUID generation in Boundary’s [Flake project announcement](http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang/) (see also Twitter’s [Snowflake](https://github.com/twitter/snowflake)).
You can also "block grant" counters: a central server gives me a lease on
HBase actually provides atomic counters
Another is to have an enlightened Bodhisattva hold the single running value in mindfulness.
1 million counter updates per second on 100 nodes (10k ops per node) Use a different column family for month, day, hour, etc (with different ttl) for increment
counters and TTLs — http://grokbase.com/t/hbase/user/119x0yjg9b/settimerange-for-hbase-increment
Second, for each visitor we want to keep a live count of times they’ve viewed each distinct URL. In principle, you could use the cookie_url
table, Maintaining a consistent count is harder than it looks: for example, it does not work to read a value from the database, add one to it, and write the new value back. Some other client may be busy doing the same, and so one of the counts will be off. Without native support for counters, this simple process requires locking, retries, or other complicated machinery.
HBase offers atomic counters: a single incr
command that adds or subtracts a given value, responding with the new value. From the client perspective it’s done in a single action (hence, "atomic") with guaranteed consistence. That makes the visitor-URL tracking trivial. Build a table called cookie_url
, with a column family u
. On each page view:
-
Increment the counter for that URL:
count = incr(table: "cookie_url_count", row: cookie, col: "u:#{url}")
.
The return value of the call has the updated count. You don’t have to initialize the cell; if it was missing, HBase will treat it as having had a count of zero.
We’d also like to track, for each visitor, the most frequent ("top-k") URLs they visit. This might sound like the previous table, but it’s very different — locality issues typically make such queries impractical. In the previous table, all the information we need (visitor, url, increment) to read or write is close at hand. But you can’t query that table by "most viewed" without doing a full scan; HBase doesn’t and won’t directly support requests indexed by value. You might also think "I’ll keep a top-k leaderboard, and update it if the currently-viewed URL is on it" — but this exposes the consistency problem you were just warned about above.
There is, however, a filthy hack that will let you track the single most frequent element, by abusing HBase’s timestamp feature. In a table cookie_stats
with column family c
having VERSIONS: 1
. Then on each pageview,
-
As before, increment the counter for that URL:
count = incr(table: "cookie_url_count", row: cookie, col: "u:#{url}")
. The return value of the call has the updated count. -
Store the URL in the
cookie_stats
table, but use a timestamp equal to that URL’s count — not the current time — in your request:put("cookie_stats", row: cookie, col: "c", timestamp: count, value: url)
.
To find the value of the most-frequent URL for a given cookie, do a get(table: "cookie_stats", row: cookie, col: 'c')
. HBase will return the "most recent" value, namely the one with the highest timestamp, which means the value with the highest count. Although we’re constantly writing in values with lower "timestamps" (counts), HBase ignores them on queries and eventually compacts them away.
For this hack to work, the value must be forever monotonically increasing (that is, never decrease). The value "total lifetime pageviews" can only go up; "pageviews in last 30 days" will go up or down over time
////Consider, here, pointing out what the reader stands to gain, what they’ll get out of the exercise in terms of learning how to use tools for real-world applications. Amy////
These high-volume tables consume significant space and memory; it might make sense to discard data older than say 60 days. HBase lets you set a "TTL" (time-to-live) on any column family; records whose timestamp is farther in the past than that TTL won’t be returned in gets or scans, and they’ll be removed at the next compaction (TODO: major or minor?) [4].
-
Besides the pedestrian janitorial work of keeping table sizes in check, TTLs are another feature to joyfully abuse. Describe how you would use TTLs to track time-based rolling aggregates, like "average air-speed velocity over last 10 minutes".
table |
row key |
family |
qualifier |
value |
options |
visits |
|
'r' (referer) |
|
- |
|
visits |
|
's' (search) |
|
- |
|
visits |
|
'p' (product) |
|
- |
|
visits |
|
'z' (checkout) |
|
|
|
cookie_urls |
|
'u' (url) |
|
||
ip_tbs |
|
An increasing number of websites personalize content for each reader. Retailers find that even something as simple as saying "Free Shipping" or "No Sales Tax" (each true only for people in certain geographic areas) dramatically increases sales. HBase’s speed and simplicity shine for a high-stakes low-latency task like estimating the geographic location of a visitor based on their IP address
If you recall from (TODO ref server logs chapter), the Geo-IP dataset stores information about IP addresses a block at a time.
-
Fields: IP address, ISP, latitude, longitude, quadkey
-
query: given IP address, retrieve geolocation and metadata with very low latency
table |
row key |
column families |
column qualifiers |
versions |
value |
ip |
|
field name |
|
none |
Store the upper range of each IP address block in hexadecimal as the row key. To look up an IP address, do a scan query, max 1 result, on the range from the given ip_address to a value larger than the largest 32-bit IP address. A get is simply a scan-with-equality-max-1, so there’s no loss of efficiency here.
Since row keys are sorted, the first value equal-or-larger than your key is the end of the block it lies on. For example, say we had block "A" covering 50.60.a0.00
to 50.60.a1.08
, "B" covering 50.60.a1.09
to 50.60.a1.d0
, and "C" covering 50.60.a1.d1
to 50.60.a1.ff
. We would store 50.60.a1.08 ⇒ {…A…}
, 50.60.a1.d0 ⇒ {…B…}
, and 50.60.a1.ff ⇒ {…C…}
. Looking up 50.60.a1.09
would get block B, because 50.60.a1.d0
is lexicographically after it. So would 50.60.a1.d0
; range queries are inclusive on the lower and exclusive on the upper bound, so the row key for block B matches as it should.
As for column keys, it’s a tossup based on your access pattern. If you always request full rows, store a single value holding the serialized IP block metadata. If you often want only a subset of fields, store each field into its own column.
table |
row key |
family |
qualifier |
value |
|
articles |
|
|
text |
||
article_versions |
|
|
text |
timestamp: updated_time |
|
article_revisions |
|
|
text, user_id, comment |
categories |
|
|
|
redirects |
|
Just as we saw with Hadoop, there are two sound choices for storing a graph: as an edge list of from,into
pairs, or as an adjacency list of all into
nodes for each from
node.
table |
row key |
column families |
column qualifiers |
value |
options |
page_page |
|
|
(none) |
(none) |
|
page_links |
|
|
|
(none) |
page_links_ro |
If we were serving a live wikipedia site, every time a page was updated I’d calculate its adjacency list and store it as a static, serialized value.
For a general graph in HBase, here are some tradeoffs to consider:
-
The pagelink graph never has more than a few hundred links for each page, so there are no concerns about having too many columns per row. On the other hand, there are many celebrities on the Twitter "follower" graph with millions of followers or followees. You can shard those cases across multiple rows, or use an edge list instead.
-
An edge list gives you fast "are these two nodes connected" lookups, using the bloom filter on misses and read cache for frequent hits.
-
If the graph is read-only (eg a product-product similarity graph prepared from server logs), it may make sense to serialize the adjacency list for each node into a single cell. You could also run a regular map/reduce job to roll up the adjacency list into its own column family, and store deltas to that list between rollups.
-
I’ve drawn heavily on the wisdom of HBase Book
-
Thanks to Lars George for many of these design guidelines, and the "Design for Reads" motto.
-
HBase Advanced Schema Design by Lars George
-
http://www.quora.com/What-are-the-best-tutorials-on-HBase-schema
-
encoding numbers for lexicographic sorting:
-
an insane but interesting scheme: http://www.zanopha.com/docs/elen.pdf
-
a Java library for wire-efficient encoding of many datatypes: https://github.com/mrflip/orderly
-