Beyond Enumerable: Counting Distinct with HyperLogLog

How do you count distinct values when you can’t afford to remember them? Counting distinct requires remembering what you’ve already seen, and that memory grows with the data. A Set of a billion user IDs is gigabytes. The streaming tricks from the last post don’t help because there’s no way to combine “is this new?” into a running total without holding onto what you’ve already seen.

In the last post we made a trade to offload batches to files to avoid maxing out RAM, and in this post we’re going to make another trade which will make developers uncomfortable: we’re trading exactness for space savings at scale.

We called heads

Humor me for a moment. Say that you were flipping coins and someone happened to get four heads in a row, not exactly uncommon but it’s much less likely than say one or two heads in a row. How about twenty? Each flip is 50/50, so twenty in a row is ½ multiplied twenty times, which is 1 in 2**20, or one in about a million. That’s a couple of weeks of non-stop flipping, give or take. If someone managed to land that you’d think it’s either a rigged coin or they’ve been at it for a while. Either that or they’re really lucky, but hold onto that thought, we’ll get back to it.

There’s an interesting insight in there though: We know immediately how likely that is, without ever flipping those coins.

What does that have to do with this article though? Coin flips are, in essence, binary: two outcomes, heads or tails, ones or zeros. If you convert something into a stream of bits (where each position can only be 0 or 1) you have a collection of those exact coin flips. As it happens it’s also pretty rare to have a bunch of zeroes at the start of a binary string.

In programming we have hashing functions which can transform something into an evenly-spread series of ones and zeroes, where each bit is equally likely to be 0 or 1. Why not look at the data directly? Because real data isn’t evenly spread. User IDs are sequential, names cluster, timestamps are monotonic. Leading-zero runs on raw data would be biased. Hashing scrambles any input into uniform bits, which is the only reason the coin-flip math applies.

What’s a hash function? It takes any input (a string, a number, an object) and produces a fixed-size number. The same input always gives the same output, but different inputs give wildly different outputs with no discernible pattern. Think of it like a fingerprint: deterministic, fixed size, and scrambled enough that similar inputs don’t produce similar outputs.

module Hashing
  module_function

  # A stable 64-bit integer for any item.
  def to_64_bits(item)
    Digest::SHA256.hexdigest(item.to_s)[0, 16].to_i(16)
  end
end

We’re using SHA-256 truncated to 64 bits. The specific algorithm doesn’t matter, what matters is that its output bits are evenly spread. Each hex character packs 4 bits, so the first 16 hex characters give us 64 bits, sixty-four coin flips to judge rarity by.

Step one: a filling bitmap

What if we allocated a big chunk of bits, all starting at zero, and used each hashed value to pick one of them to flip on?

That’s a bitmap: a fixed-size array where each position is a single bit, either 0 (never seen) or 1 (something hashed here). It doesn’t store the items themselves, it stores evidence that something landed in this spot. Two items hashing to the same position flip the same bit, so we lose the ability to tell them apart, but we don’t care about identity. We care about how many distinct things have left marks.

The simplest version is a boolean array:

class LinearCountingString
  def initialize(bits: 1 << 20)
    # Picture a row of a million light switches, all starting OFF.
    @bit_count = bits
    @bitmap = Array.new(bits, false)
  end

  def add(item)
    # Hash the item to get a number.
    # Use that number to pick one switch and flip it ON.
    # If two different items pick the same switch, oh well,
    # they share it. We're counting approximately, not exactly.
    position = Hashing.to_64_bits(item) % @bit_count
    @bitmap[position] = true
  end

  def estimate
    # How many switches are still OFF?
    zeros = @bitmap.count(false)

    # If every switch is ON, we've run out of room to count.
    return @bit_count if zeros.zero?

    # Fewer OFF switches means more distinct items passed through.
    # This formula turns that ratio into an approximate count.
    (-@bit_count * Math.log(zeros.to_f / @bit_count)).round
  end
end

As more distinct items arrive, more positions flip to true. Duplicates land on positions that are already true and leave no trace. So the number of positions still false tells us how many distinct items have passed through.

How do we turn that into a number? Here’s the math:

def linear_counting_estimate(total_bits, zeros)
  # We have a bitmap with `total_bits` switches.
  # `zeros` of them are still OFF (never been hit).
  #
  # The ratio zeros/total_bits tells us "what fraction is still empty?"
  #   - If 90% are still OFF, not much has come through yet.
  #   - If only 1% are still OFF, a LOT of distinct items have passed through.
  #
  # Math.log (natural logarithm) converts that fraction into a rate.
  # Think of it as answering: "how many times did we have to throw darts
  # at this board to leave only this many empty spots?"
  #
  # The result of Math.log(zeros / total_bits) is always negative
  # (because zeros/total_bits is less than 1, and log of a fraction is negative).
  # Multiplying by -total_bits flips it positive and scales it up
  # to give us the estimated count.
  #
  # .to_f ensures we get decimal division, not integer division.
  # Without it, 900000 / 1000000 would give 0 instead of 0.9.

  -total_bits * Math.log(zeros.to_f / total_bits)
  # => approximate distinct count
end

That’s Linear Counting (Whang, Vander-Zanden, Taylor, 1990). A 128 KB map estimates fifty thousand distinct items to within a few percent, and a thousand copies of one value still read as one.

The problem happens when the entire thing fills and you end up dividing by zero, breaking the estimate. That’s called saturation, the point at which the structure can’t tell the difference between “a lot” and “a lot plus one.” To count billions of items you’d need a bitmap so large that it defeats the point, so we’re going to need something which doesn’t fill up.

Step two: the leading-zeros insight

If bitmaps count by how full it got, is there a different signal we could use? One that doesn’t fill?

Remember the coin flips? Each bit of a hash value is an independent flip. A run of leading zeros is like a run of consecutive heads, and the longer the run the rarer it is:

def leading_zeros_string_examples
  # Take a hash value and look at it as a string of zeros and ones.
  # Count how many zeros come before the first one.
  # That count is our "coin flip run length."

  bits_a = "10110100"  # starts with 1, so zero leading zeros
  bits_a.index("1")    # => 0

  bits_b = "00010110"  # three zeros before the first 1
  bits_b.index("1")    # => 3

  bits_c = "00000001"  # seven zeros before the first 1 (very rare!)
  bits_c.index("1")    # => 7
end

A hash value with no leading zeros (first bit is 1) shows up about half the time. Ten leading zeros shows up about once in 2**10 distinct values. Twenty shows up once in about a million.

Why does that matter? Flip the logic around: if the longest run you’ve ever seen happens about once in a million, you’ve probably had about a million chances at it. A maximum run of 30 means you’ve likely seen around 2**30 distinct values. Unlike the bitmap, this never saturates because longer runs keep appearing as more distinct values arrive. The problem is that a single really lucky streak can throw the whole guess off.

Step three: taming the noise

One unlucky (or really lucky, depending on your POV) series of flips and this entire idea goes out a window. Does that make this algorithm invalid? No, not really, we handle it much the same as we handle other noisy signals: We take a lot of them and average.

But what kind of average? A regular (arithmetic) average lets one outlier drag everything up, so we’re going to need something else. The harmonic mean weights toward smaller values, so a single large number barely registers:

def harmonic_mean(values)
  # Arithmetic mean: sum all values, divide by count.
  #   [1, 1, 1, 100] => (1+1+1+100) / 4 = 25.75
  #   One outlier dominates.
  #
  # Harmonic mean: count divided by the sum of reciprocals.
  #   A reciprocal is 1 divided by a number (the "flip" of it).
  #   [1, 1, 1, 100] => 4 / (1/1 + 1/1 + 1/1 + 1/100)
  #                    = 4 / (1 + 1 + 1 + 0.01)
  #                    = 4 / 3.01
  #                    ≈ 1.33
  #   The outlier barely registers.
  #
  # This is why HyperLogLog uses harmonic mean:
  #
  # A single measurement that got "lucky" (saw a long run of zeros)
  # can't drag the entire estimate up the way arithmetic would.

  values.size.to_f / values.sum { |value| 1.0 / value }
end

The code comments walk through a worked example, but the short version: if three buckets say “1” and one bucket says “100” because it got lucky, the arithmetic mean gives you 25.75 (dominated by the outlier). The harmonic mean gives you 1.33 (the lucky one barely moves the needle). That’s the property we want when one bucket’s measurement might be really danged lucky.

Step four: many registers, one estimate

We need many independent measurements from a single stream. Use part of the hash value to pick a bucket, and the rest to do the leading-zero measurement. Each bucket runs its own coin-flip experiment independently, so no single lucky hash value can corrupt more than one.

Here’s the whole thing, using string operations so every step is visible:

class HyperLogLogString
  HASH_BITS = 64

  def initialize(precision: 14)
    @precision = precision

    # How many buckets? 2^precision of them.
    # More buckets = more accurate = more memory.
    # 2^14 = 16,384 buckets. That's what Redis uses.
    @register_count = 2**precision

    # Each bucket remembers one number: the longest run of
    # leading zeros it has ever seen. Starts at 0 ("nothing yet").
    @registers = Array.new(@register_count, 0)
  end

  def add(item)
    # Step 1: Turn the hash value into a string of 64 zeros and ones.
    # "%064b" means: format as binary, padded to 64 characters.
    # Example result: "0010110100001011..." (always 64 chars)
    bits = format("%064b", Hashing.to_64_bits(item))

    # Step 2: The first few characters decide which bucket.
    # .to_i(2) reads the string as a base-2 (binary) number.
    # With precision 14, this gives a number from 0 to 16,383.
    # Think of it as: "which table does this person sit at?"
    index = bits[0, @precision].to_i(2)

    # Step 3: Everything after that is what we measure.
    rest = bits[@precision..]

    # Step 4: Count how many zeros appear before the first one.
    # This is our "coin flip run." More zeros = rarer = more distinct items.
    # .index("1") finds the position of the first "1" character.
    # If there's no "1" at all, every character is a zero (maximum rarity).
    zeros = rest.index("1") || rest.length

    # Step 5: Add 1 so that "no leading zeros" is 1, not 0.
    # That keeps 0 meaning "bucket has never been used."
    rank = zeros + 1

    # Step 6: Only keep it if it's bigger than what we had.
    # We want the rarest thing this bucket has ever seen.
    @registers[index] = [@registers[index], rank].max
  end

  def estimate
    # Combine all the buckets using harmonic mean.
    # This averages them in a way that ignores lucky outliers.
    harmonic = @registers.sum { |rank| 2.0**-rank }
    raw = alpha * @register_count * @register_count / harmonic
    empty = @registers.count(&:zero?)

    # If most buckets are still empty, use a simpler formula
    # that's more accurate at small counts.
    if raw <= 2.5 * @register_count && empty.positive?
      (@register_count * Math.log(@register_count.to_f / empty)).round
    else
      raw.round
    end
  end

  def merge(other)
    # To combine two counters: take the bigger number from each bucket.
    # If counter A saw a run of 5 in bucket 7, and counter B saw 9,
    # then across both of them, someone saw 9. Keep 9.
    raise ArgumentError, "precision mismatch" unless @precision == other.precision
    merged = HyperLogLogString.new(precision: @precision)
    new_regs = @registers.zip(other.registers).map { |a, b| [a, b].max }
    merged.instance_variable_set(:@registers, new_regs)
    merged
  end

  attr_reader :precision, :registers

  private

  # Without correction, the estimate runs about 40% too high.
  # This happens because some items inevitably land in the same bucket,
  # inflating the harmonic mean. Alpha scales the result back down.
  # The specific numbers (0.7213, 1.079) come from the paper's math
  # and work for any bucket count above 128.
  def alpha = 0.7213 / (1.0 + (1.079 / @register_count))
end

More buckets means less noise, same reason polling more people gives a better survey. With 16,384 buckets the error is about 0.8%, meaning if the true count is a million the estimate will land between roughly 992,000 and 1,008,000.

That’s HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007).

Watching it work

The algorithm is easier to believe when you can see it filling up. Here’s a tiny version with 16 buckets processing ten names:

def hyperloglog_trace
  # Small example: 16 buckets so we can see everything.
  hll = HyperLogLogString.new(precision: 4)
  names = %w[alice bob carol dave eve frank grace heidi ivan judy]

  names.each do |name|
    hll.add(name)

    # Show what happened for this name:
    # - Which bucket did it land in?
    # - How many leading zeros did it have (rank)?
    # - What do all 16 buckets look like now?
    bits = format("%064b", Hashing.to_64_bits(name))
    index = bits[0, 4].to_i(2)
    rest = bits[4..]
    zeros = rest.index("1") || rest.length
    rank = zeros + 1
    puts format("add %-6s -> bucket %2d, rank %2d   registers: %s",
      name, index, rank, hll.registers.inspect)
  end

  puts "\nestimate after #{names.size} distinct: #{hll.estimate}"
end
# Output:
#   add alice  -> bucket  2, rank  1   registers: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#   add bob    -> bucket  8, rank  4   registers: [0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0]
#   add carol  -> bucket  4, rank  1   registers: [0, 0, 1, 0, 1, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0]
#   add dave   -> bucket  6, rank  4   registers: [0, 0, 1, 0, 1, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0]
#   add eve    -> bucket  8, rank  2   registers: [0, 0, 1, 0, 1, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0]
#   add frank  -> bucket  7, rank  2   registers: [0, 0, 1, 0, 1, 0, 4, 2, 4, 0, 0, 0, 0, 0, 0, 0]
#   add grace  -> bucket 14, rank  8   registers: [0, 0, 1, 0, 1, 0, 4, 2, 4, 0, 0, 0, 0, 0, 8, 0]
#   add heidi  -> bucket  0, rank  2   registers: [2, 0, 1, 0, 1, 0, 4, 2, 4, 0, 0, 0, 0, 0, 8, 0]
#   add ivan   -> bucket 12, rank  1   registers: [2, 0, 1, 0, 1, 0, 4, 2, 4, 0, 0, 0, 1, 0, 8, 0]
#   add judy   -> bucket  7, rank  4   registers: [2, 0, 1, 0, 1, 0, 4, 4, 4, 0, 0, 0, 1, 0, 8, 0]
#
#   estimate after 10 distinct: 11

Notice how eve lands in bucket 8 with rank 2, but bucket 8 already has rank 4 from bob, so the register doesn’t change. That’s the “keep the max” rule doing its job.

The estimate after 10 distinct items is 11. Off by one. With only 16 buckets the noise is high, but bump to 16,384 and it tightens to under 1%.

Merge: combining counters across servers

Say you have ten servers, each counting distinct users independently. You want the total across all of them. With a Set you’d have to ship every set to one machine and union them. With HyperLogLog you ship a few kilobytes from each and combine in one pass.

How? Each bucket stores “the longest run I’ve ever seen.” To combine two counters, take the bigger number from each bucket.

def hyperloglog_merge_example
  a = HyperLogLogString.new(precision: 14)
  b = HyperLogLogString.new(precision: 14)

  5_000.times { |i| a.add("server_a_user_#{i}") }
  5_000.times { |i| b.add("server_b_user_#{i}") }

  merged = a.merge(b)
  merged.estimate
  # => ~10000 (combined distinct count from both servers)
end

Redis exposes this as PFMERGE. This is the property that makes HLL worth reaching for over a simple counter: you get composability for free.

Where you’ve already used this

If you use Redis, you’ve used this. PFADD adds an item, PFCOUNT returns the estimated distinct count, PFMERGE combines two counters. The “PF” stands for Philippe Flajolet, one of the paper’s authors.

Redis uses 16,384 buckets, 6 bits per bucket, twelve kilobytes total, 0.81% error. Same structure we built. Twelve kilobytes whether the count is a thousand or five billion.

Making it fast: the bit-packed version

For teaching purposes the string version is more accessible. The problem is that it’s slow. Converting to a 64-character string on every add is expensive in a hot loop. The production version does the same operations with bit shifts and masks.

If you’re comfortable with bit manipulation, here’s the optimized version. If not, skip this section. The algorithm is identical:

class HyperLogLog
  HASH_BITS = 64

  def initialize(precision: 14)
    @precision = precision
    @register_count = 1 << precision
    @registers = Array.new(@register_count, 0)
  end

  def add(item)
    hash = Hashing.to_64_bits(item)

    # Split the 64-bit hash value into two parts:
    #   [  14 bits  |          50 bits              ]
    #   [ register  |  remainder (count zeros here) ]
    #
    # Top 14 bits: which register (0..16383)
    register_index = hash >> (HASH_BITS - @precision)

    # Bottom 50 bits: the remainder we count leading zeros in.
    remainder = hash & ((1 << (HASH_BITS - @precision)) - 1)

    # rank = leading zeros + 1
    rank = (HASH_BITS - @precision) - remainder.bit_length + 1

    # Keep the largest rank this register has ever seen.
    @registers[register_index] = [@registers[register_index], rank].max
  end

  def estimate
    harmonic = @registers.sum { |rank| 2.0**-rank }
    raw = alpha * @register_count * @register_count / harmonic
    empty = @registers.count(&:zero?)

    if raw <= 2.5 * @register_count && empty.positive?
      (@register_count * Math.log(@register_count.to_f / empty)).round
    else
      raw.round
    end
  end

  def merge(other)
    raise ArgumentError, "precision mismatch" unless @precision == other.precision
    merged = HyperLogLog.new(precision: @precision)
    new_regs = @registers.zip(other.registers).map { |a, b| [a, b].max }
    merged.instance_variable_set(:@registers, new_regs)
    merged
  end

  attr_reader :precision, :registers

  private

  # Same bias correction as the string version (see comments there).
  def alpha = 0.7213 / (1.0 + (1.079 / @register_count))
end

Each operation maps to the string version:

hash >> (64 - precision) grabs which bucket
hash & ((1 << remaining) - 1) grabs the remainder
remainder.bit_length finds the first 1-bit

The bit-packed Linear Counting packs 64 switches per integer instead of using a boolean array:

class LinearCounting
  def initialize(bits: 1 << 20)
    @bit_count = bits
    @words = Array.new((bits + 63) / 64, 0)
  end

  def add(item)
    position = Hashing.to_64_bits(item) % @bit_count
    @words[position >> 6] |= (1 << (position & 63))
  end

  def estimate
    set_bits = @words.sum { |word| word.to_s(2).count("1") }
    zero_bits = @bit_count - set_bits
    return @bit_count if zero_bits.zero?
    (-@bit_count * Math.log(zero_bits.to_f / @bit_count)).round
  end
end

Both produce identical results to the string versions. The difference is speed. At precision 14 the implementation lands within 0.1% at a million distinct, 0.4% at 100k, and 0.65% at 5M, all inside the 0.81% error the math predicts.

Wrapping Up

Twelve kilobytes of memory, sub-1% error, works the same at a thousand items or five billion, and two counters merge by taking the max of each bucket. That’s the trade you’re making when you give up exactness: a structure that never fills up and never needs to remember what it’s seen.

Next up we apply the same ideas to a different question: not “how many distinct” but “have I seen this one before.” That’s Bloom filters.