This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters. A minimal example:
require 'redis'
require 'hyperloglog-redis'
counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
counter.add('beatles', beatle)
end
puts "There are approximately #{counter.count('beatles')} distinct Beatles"
Each HyperLogLog counter uses a small, fixed amount of space but can
estimate the cardinality of any set of up to around a billion values with
relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
parameter passed to the HyperLogLog::Counter
initializer that defaults to 10.
With b = 10, each counter is represented by a 1 KB string in Redis and we get
an expected relative error of 3%. Contrast this with the amount of space needed
to compute set cardinality exactly, which is over 100 MB for a even a bit vector
representing a set with a billion values.
The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values, you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that's the idea.
This gem implements a few useful extensions to the basic HyperLogLog algorithm which allow you to estimate unions and intersections of counters as well as counts within specific time ranges. These extensions are described in detail below.
The HyperLogLog algorithm is described and analyzed in the paper "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely follows the program described in Section 4 of that paper.
You can also ask for an estimate of the union from multiple counters:
['joe', 'denny', 'linda', 'jimmy', 'paul'].each do |wings_member|
counter.add('wings', wings_member)
end
puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
The same relative error guarantee above applies to unions: a union of
size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
regardless of how many HyperLogLog counters that union spans. You can store
a unioned counter for querying or combining later with union_store
:
counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
Intersections can also be estimated:
puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
However, intersections of HyperLogLog counters are calculated indirectly via the inclusion/exclusion principle as a sum of unions and there aren't good theoretical bounds on the error of that sum. In practice, the estimates that come out of small intersections tend to follow the same relative error patterns, but beware using this type of estimation on intersections of large numbers of sets, both because the errors can be much larger than those guaranteed for unions and the complexity of computing intersections grows exponentially with the number of sets in the intersection.
All examples up until now use HyperLogLog::Counter
, which stores HyperLogLog
counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
HyperLogLog::TimeSeriesCounter
, which uses a little more space (Redis strings of up to
4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
certain time windows.
Using HyperLogLog::TimeSeriesCounter
, you can get estimates of the number of distinct
elements added to a set in the past X seconds, for any value of X. A HyperLogLog::TimeSeriesCounter
is initialized with the same arguments as a regular Counter
but implements a
superset of HyperLogLog::Counter
's interface. Namely, each of the methods add
,
count
, union
, intersection
, and union_store
take an optional final time argument,
either a Ruby Time
or an integer representing seconds since the epoch.
When passed a time argument t, add
registers an addition to the set at time t. When no
time is passed, the current system time is used. The methods count
, union
,
intersection
, and union_store
all estimate set cardinality for the time interval
consisting of all events that happened after time t when t is passed as a final argument.
For example, to get the number of distinct user logins within the past week, we might call:
one_week = 60 * 60 * 24 * 7
logins_in_past_week = counter.count('user_logins', Time.now - one_week)
With a parameter b
in the range [4..16], HyperLogLog counters provide a relative
error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
time range queries are involved, it's sometimes not clear what the relative error
is relative to, so here is some clarification:
-
For a union of counters, the relative error applies to the size of the union. Taking the union of counters is lossless in the sense that you end up with the same counter you would have arrived at had you observed the union of all of the individual events.
-
For an intersection of counters, there's no good theoretical bound on the relative error. In practice, and especially for intersections involving a small number of sets, the relative error you obtain tends to be in relation to the size of the union of the sets involved. For example, if you have two sets, each of cardinality 5000 and observe both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection size.
-
For time queries, the relative error applies to the size of the set within the time range you've queried. For example, given a set of cardinality 1,000,000 that has had 100 distinct additions within the last 10 minutes, if you observe such a set with a HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count returned from a query about the last 10 minutes to be within 3 of 100.
When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
- Bit vectors: you provide some near-perfect hash function between keys in your domain and an interval of integers, then represent that interval of integers with bits.
- Bloom filters with counters: use a Bloom filter to keep track of items seen; on insert, when the Bloom filter tells you that the item seen is not in the set, increment the counter.
Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
data structures and simulate counters for time-ranges like HyperLogLog::TimeSeriesCounter
.
Bit vectors give exact counts, but the space complexity is linear with the size of the set, and you must either allocate a large bit vector upfront or cope with the complexity of dynamically resizing your bit vector as the set grows. Providing a manual mapping from members of your set to an interval of integers is sometimes a non-trivial task. Counts, unions, and intersections are all linear-time operations in the size of the universe of the set being represented.
Bloom filters can be much more compact than bit vectors, but the actual count associated with a Bloom filter is an artifact of the construction of the data structure, so the cost of estimating a union or intersection is linear in the size of the Bloom filter. Getting high probability guarantees on the quality of the estimate of Bloom filter counts requires several "good" hash functions that have some degree of independence from each other; in practice, coming up with several independent implementations of good hash functions is difficult. Bloom filters require that all of their space be allocated upfront (re-hashing isn't possible without replaying all events), so in practice you need some estimate of how large the counters are going to be before allocating the counter.
HyperLogLog counters take up less space than either of the above approaches and provide
constant-time implementations (in the size of the sets being represented) of unions,
intersections, and time range queries. A HyperLogLog::Counter
with parameter b will
be stored in a Redis string of length at most 2 ** b bytes, whereas a HyperLogLog::TimeSeriesCounter
with parameter
b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
the size taken up by a HyperLogLog::TimeSeriesCounter
can be significantly less. Here
are some examples for specific values of b:
- With b = 7, a
HyperLogLog::Counter
uses at most 128 bytes and aHyperLogLog::TimeSeriesCounter
uses at most 13 KB while providing a relative error of 9%. - With b = 11, a
HyperLogLog::Counter
uses at most 2 KB and aHyperLogLog::TimeSeriesCounter
uses at most 168 KB while providing a relative error of 2% - With b = 16, a
HyperLogLog::Counter
uses at most 64 KB and aHyperLogLog::TimeSeriesCounter
uses at most 4 MB while providing a relative error of less than half a percent.
gem install hyperloglog-redis