CSE 350 - Data Sketching

Data Sketching

CSE 350

March 25, 2021

Garcia-Molina/Ullman/Widom: (readings only)
  • SELECT COUNT(DISTINCT A) FROM R
  • SELECT A, COUNT(*) FROM R GROUP BY A
  • SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10

These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?

Sketching: Hash function tricks used to estimate useful statistical properties.

Flajolet-Martin Sketches (HyperLogLog)
Estimating Count-Distinct
Count Sketches
Estimating Count-GroupBy
Count-Min Sketches
Estimating Count-GroupBy-TopK

Count-Distinct

$3$ $5$ $4$ $4$ $2$ $4$ $3$ $\ldots$
$3$ $5$ $4$ $2$ $\ldots$

Challenge: To avoid double counting, we need to track which values of $A$ we've seen. $O(|A|)$ memory required.

A brief digression

The Coin Flip Game

Start with 0 points and flip a coin
Tails (🐕)
Get a point and flip again.
Heads (👽)
Game over.
FlipsScore
(👽) 0
(🐕) (👽) 1
(🐕) (🐕) (🐕) (🐕) (🐕) (👽) 5
FlipsScoreProbability E[# Games]
(👽)00.5 2
(🐕)(👽)10.25 4
(🐕)(🐕)(👽)20.125 8
(🐕)$\times N$   (👽)$N$$\frac{1}{2^{N+1}}$ $2^{N+1}$

If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.

To do that, I only need to track my top score!

Idea: Simulate coin flips with a hash function

... take the index of the lowest-order nonzero bit

ObjectHash BitsScore
$O_1$010110110
$O_2$001101110
$O_3$001110003
$O_4$100100101
$O_3$001110003
3

Estimate: $2^{3+1} = 16$

Duplicates can't raise the top score!

Problem: Noisy estimate!

Idea 1: Instead of your top score, track the lowest score you have not gotten yet ($R$).

ObjectHash BitsScore
$O_1$010110110
$O_2$001101110
$O_3$001110003
$O_4$100100101
$O_3$001110003
{0, 1, 3}
$R = 2$

Estimate: $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$

Idea 2: Compute several estimates in parallel and average estimates.

Flajolet-Martin Sketches

($\approx$ HyperLogLog)

  1. For each record...
    1. Hash each record
    2. Find the index of the lowest-order non-zero bit
    3. Add the index of the bit to a set
  2. Find $R$, the lowest index not in the set
  3. Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)
  4. Repeat (in parallel) as needed

Group-By Count

Problem: Need a counter for each individual A

Idea: Keep only one counter!

No... seriously

$$\delta(O_i) = \begin{cases} \textbf{if } h(O_i) = 0 \mod 2 & \textbf{then } -1 \\ \textbf{if } h(O_i) = 1 \mod 2 & \textbf{then } +1\end{cases}$$
$$\sum_i \delta(O_i)$$
Object$\delta(O_i)$Running Count
$O_3$-1-1
$O_1$+10
$O_4$-1-1
$O_2$+10
$O_4$-1-1
$O_1$+10
$O_3$-1-1
$O_3$-1-2
$O_1$+1-1
$Total =$
$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$
$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$
$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$=
$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$
$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$

$$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$

Running total was $-1$

Object$\delta(O_i)$Estimate
$O_1$+1-1
$O_2$+1-1
$O_3$-1+1
$O_4$-1+1

Not... so... great

Problem 1: All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).

Problem 2: The estimate is really noisy

Idea 1: Multiple Buckets ($h(x)$ picks a bucket)

Idea 2: Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)

Object $h_1(O_i)$ $\delta_1(O_i)$ $h_2(O_i)$ $\delta_2(O_i)$
$O_1$ Bucket 1 -1 Bucket 2 1
$O_2$ Bucket 1 -1 Bucket 1 -1
$O_3$ Bucket 2 1 Bucket 1 -1
$O_4$ Bucket 1 -1 Bucket 1 1

Objects Seen: $$

Bucket 1Bucket 2
Trial 0 00
Trial 1 00
Object Trial 1Trial 2 EstimateReal
$O_1$ 0 0 0.0 0
$O_2$ 0 0 0.0 0
$O_3$ 0 0 0.0 0
$O_4$ 0 0 0.0 0

Objects Seen: $O_2$

Bucket 1Bucket 2
Trial 0 -10
Trial 1 -10
Object Trial 1Trial 2 EstimateReal
$O_1$ 1 0 0.5 0
$O_2$ 1 1 1.0 1
$O_3$ 0 1 0.5 0
$O_4$ 1 -1 0.0 0

Objects Seen: $O_2,O_1$

Bucket 1Bucket 2
Trial 0 -20
Trial 1 -11
Object Trial 1Trial 2 EstimateReal
$O_1$ 2 1 1.5 1
$O_2$ 2 1 1.5 1
$O_3$ 0 1 0.5 0
$O_4$ 2 -1 0.5 0

Objects Seen: $O_2,O_1,O_4$

Bucket 1Bucket 2
Trial 0 -30
Trial 1 01
Object Trial 1Trial 2 EstimateReal
$O_1$ 3 1 2.0 1
$O_2$ 3 0 1.5 1
$O_3$ 0 0 0.0 0
$O_4$ 3 0 1.5 1

Objects Seen: $O_2,O_1,O_4,O_1$

Bucket 1Bucket 2
Trial 0 -40
Trial 1 02
Object Trial 1Trial 2 EstimateReal
$O_1$ 4 2 3.0 2
$O_2$ 4 0 2.0 1
$O_3$ 0 0 0.0 0
$O_4$ 4 0 2.0 1

Objects Seen: $O_2,O_1,O_4,O_1,O_2$

Bucket 1Bucket 2
Trial 0 -50
Trial 1 -12
Object Trial 1Trial 2 EstimateReal
$O_1$ 5 2 3.5 2
$O_2$ 5 1 3.0 2
$O_3$ 0 1 0.5 0
$O_4$ 5 -1 2.0 1

In practice, use Median and not Mode to combine trials

Top-K Group-By Count

Problem: "Heavy Hitters" overwhelm smaller counts

Idea: Give up. Drop $\delta$.

Count-Min Sketch

Object Appearances $h_1(O_i)$ $h_2(O_i)$
$O_1$ 10 Bucket 1 Bucket 2
$O_2$ 32 Bucket 1 Bucket 1
$O_3$ 1002 Bucket 2 Bucket 1
$O_4$ 500 Bucket 1 Bucket 1
Bucket 1Bucket 2
Trial 0 5421002
Trial 1 153410
Bucket 1Bucket 2
Trial 0 5421002
Trial 1 153410
Object Appearances Estimate 1 Estimate 2 Min
$O_1$ 10 542 10 10
$O_2$ 32 542 1534 542
$O_3$ 1002 1002 1534 1002
$O_4$ 500 542 1534 542