March 25, 2021
SELECT COUNT(DISTINCT A) FROM R
SELECT A, COUNT(*) FROM R GROUP BY A
SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10
These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?
Sketching: Hash function tricks used to estimate useful statistical properties.
Challenge: To avoid double counting, we need to track which values of $A$ we've seen. $O(|A|)$ memory required.
A brief digression
Flips | Score |
---|---|
(👽) | 0 |
(🐕) (👽) | 1 |
(🐕) (🐕) (🐕) (🐕) (🐕) (👽) | 5 |
Flips | Score | Probability | E[# Games] |
---|---|---|---|
(👽) | 0 | 0.5 | 2 |
(🐕)(👽) | 1 | 0.25 | 4 |
(🐕)(🐕)(👽) | 2 | 0.125 | 8 |
(🐕)$\times N$ (👽) | $N$ | $\frac{1}{2^{N+1}}$ | $2^{N+1}$ |
If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.
To do that, I only need to track my top score!
Idea: Simulate coin flips with a hash function
... take the index of the lowest-order nonzero bit
Object | Hash Bits | Score |
---|---|---|
$O_1$ | 01011011 | 0 |
$O_2$ | 00110111 | 0 |
$O_3$ | 00111000 | 3 |
$O_4$ | 10010010 | 1 |
$O_3$ | 00111000 | 3 |
3 |
Estimate: $2^{3+1} = 16$
Duplicates can't raise the top score!
Problem: Noisy estimate!
Idea 1: Instead of your top score, track the lowest score you have not gotten yet ($R$).
Object | Hash Bits | Score |
---|---|---|
$O_1$ | 01011011 | 0 |
$O_2$ | 00110111 | 0 |
$O_3$ | 00111000 | 3 |
$O_4$ | 10010010 | 1 |
$O_3$ | 00111000 | 3 |
{0, 1, 3} $R = 2$ |
Estimate: $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$
Idea 2: Compute several estimates in parallel and average estimates.
Problem: Need a counter for each individual A
Idea: Keep only one counter!
No... seriously
Object | $\delta(O_i)$ | Running Count |
---|---|---|
$O_3$ | -1 | -1 |
$O_1$ | +1 | 0 |
$O_4$ | -1 | -1 |
$O_2$ | +1 | 0 |
$O_4$ | -1 | -1 |
$O_1$ | +1 | 0 |
$O_3$ | -1 | -1 |
$O_3$ | -1 | -2 |
$O_1$ | +1 | -1 |
$Total =$ |
$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$ |
$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$ |
$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$= |
$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$ |
$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$ |
$$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$
Running total was $-1$
Object | $\delta(O_i)$ | Estimate |
---|---|---|
$O_1$ | +1 | -1 |
$O_2$ | +1 | -1 |
$O_3$ | -1 | +1 |
$O_4$ | -1 | +1 |
Not... so... great
Problem 1: All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).
Problem 2: The estimate is really noisy
Idea 1: Multiple Buckets ($h(x)$ picks a bucket)
Idea 2: Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)
Object | $h_1(O_i)$ | $\delta_1(O_i)$ | $h_2(O_i)$ | $\delta_2(O_i)$ |
---|---|---|---|---|
$O_1$ | Bucket 1 | -1 | Bucket 2 | 1 |
$O_2$ | Bucket 1 | -1 | Bucket 1 | -1 |
$O_3$ | Bucket 2 | 1 | Bucket 1 | -1 |
$O_4$ | Bucket 1 | -1 | Bucket 1 | 1 |
Objects Seen: $$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 0 | 0 |
Trial 1 | 0 | 0 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 0 | 0 | 0.0 | 0 |
$O_2$ | 0 | 0 | 0.0 | 0 |
$O_3$ | 0 | 0 | 0.0 | 0 |
$O_4$ | 0 | 0 | 0.0 | 0 |
Objects Seen: $O_2$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -1 | 0 |
Trial 1 | -1 | 0 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 1 | 0 | 0.5 | 0 |
$O_2$ | 1 | 1 | 1.0 | 1 |
$O_3$ | 0 | 1 | 0.5 | 0 |
$O_4$ | 1 | -1 | 0.0 | 0 |
Objects Seen: $O_2,O_1$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -2 | 0 |
Trial 1 | -1 | 1 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 2 | 1 | 1.5 | 1 |
$O_2$ | 2 | 1 | 1.5 | 1 |
$O_3$ | 0 | 1 | 0.5 | 0 |
$O_4$ | 2 | -1 | 0.5 | 0 |
Objects Seen: $O_2,O_1,O_4$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -3 | 0 |
Trial 1 | 0 | 1 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 3 | 1 | 2.0 | 1 |
$O_2$ | 3 | 0 | 1.5 | 1 |
$O_3$ | 0 | 0 | 0.0 | 0 |
$O_4$ | 3 | 0 | 1.5 | 1 |
Objects Seen: $O_2,O_1,O_4,O_1$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -4 | 0 |
Trial 1 | 0 | 2 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 4 | 2 | 3.0 | 2 |
$O_2$ | 4 | 0 | 2.0 | 1 |
$O_3$ | 0 | 0 | 0.0 | 0 |
$O_4$ | 4 | 0 | 2.0 | 1 |
Objects Seen: $O_2,O_1,O_4,O_1,O_2$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -5 | 0 |
Trial 1 | -1 | 2 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
$O_1$ | 5 | 2 | 3.5 | 2 |
$O_2$ | 5 | 1 | 3.0 | 2 |
$O_3$ | 0 | 1 | 0.5 | 0 |
$O_4$ | 5 | -1 | 2.0 | 1 |
In practice, use Median and not Mode to combine trials
Problem: "Heavy Hitters" overwhelm smaller counts
Idea: Give up. Drop $\delta$.
Object | Appearances | $h_1(O_i)$ | $h_2(O_i)$ |
---|---|---|---|
$O_1$ | 10 | Bucket 1 | Bucket 2 |
$O_2$ | 32 | Bucket 1 | Bucket 1 |
$O_3$ | 1002 | Bucket 2 | Bucket 1 |
$O_4$ | 500 | Bucket 1 | Bucket 1 |
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 542 | 1002 |
Trial 1 | 1534 | 10 |
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 542 | 1002 |
Trial 1 | 1534 | 10 |
Object | Appearances | Estimate 1 | Estimate 2 | Min |
---|---|---|---|---|
$O_1$ | 10 | 542 | 10 | 10 |
$O_2$ | 32 | 542 | 1534 | 542 |
$O_3$ | 1002 | 1002 | 1534 | 1002 |
$O_4$ | 500 | 542 | 1534 | 542 |