March 18, 2021
Scan 1 PB at 300MB/s (SATA r2)
Today we want to clearly see the communications.
Replication | Partitioning (Sharding) | |
---|---|---|
look familiar?
Can we run each worker on one partition?
$N$ partitions in, $N$ partitions out
$N$ partitions in, $N$ partitions out
Trick question, just combines partitions!
$N$ partitions in, $1$ partition out
Aggregate needs to process $N$ partitions.
Final aggregate only needs to process $N$ tuples.
↓↓↓↓↓
Every partition from one table needs to pair
with every partition from the other.
↓↓↓↓↓↓
$$(R_1 \bowtie S_1) \uplus \ldots \uplus (R_1 \bowtie S_K)$$ $$\ldots\uplus \ldots \uplus \ldots$$ $$(R_N \bowtie S_1) \uplus \ldots \uplus (R_N \bowtie S_K)$$$S_1$ | $S_2$ | $S_3$ | $S_4$ | |
$R_1$ | $R_1\bowtie S_1$ | $R_1\bowtie S_2$ | $R_1\bowtie S_3$ | $R_1\bowtie S_4$ |
$R_2$ | $R_2\bowtie S_1$ | $R_2\bowtie S_2$ | $R_2\bowtie S_3$ | $R_2\bowtie S_4$ |
$R_3$ | $R_3\bowtie S_1$ | $R_3\bowtie S_2$ | $R_3\bowtie S_3$ | $R_3\bowtie S_4$ |
$R_4$ | $R_4\bowtie S_1$ | $R_4\bowtie S_2$ | $R_4\bowtie S_3$ | $R_4\bowtie S_4$ |
$N$ workers gets us $\sqrt{N}$ scaling
$S_1$ | $S_2$ | $S_3$ | $S_4$ | |
$R_1$ | $R_1\bowtie S_1$ | $R_1\bowtie S_2$ | $R_1\bowtie S_3$ | $R_1\bowtie S_4$ |
$R_2$ | $R_2\bowtie S_1$ | $R_2\bowtie S_2$ | $R_2\bowtie S_3$ | $R_2\bowtie S_4$ |
$R_3$ | $R_3\bowtie S_1$ | $R_3\bowtie S_2$ | $R_3\bowtie S_3$ | $R_3\bowtie S_4$ |
$R_4$ | $R_4\bowtie S_1$ | $R_4\bowtie S_2$ | $R_4\bowtie S_3$ | $R_4\bowtie S_4$ |
Back to $N$ scaling for $N$ workers
What if the partitions aren't aligned so nicely?
Can we do better?
Focus on $R_1 \bowtie_B S_1$
Problem: All tuples in $R_1$ and $S_1$ need to be
sent to the same worker.
Idea 1: Put the worker on the node that has the data!
Problem: What if the data is on 2 different nodes?
Idea 1.b: Put the worker on one of the nodes with data.
Can we reduce network use more?
Problem: Worker 2 is still sending a lot of data.
Idea: Compress $\pi_B(S_1)$
(not all errors are equal)
User: Is Alice part of the set? | $filter$: Yes |
User: Is Eve part of the set? | $filter$: No |
User: Is Fred part of the set? | $filter$: Yes |
Test always returns Yes if the element is in the set.
Test usually returns No if the element is not in the set.
A bloom filter is an array of bits.
$M$: Number of bits in the array.
$K$: Number of hash functions.
Each bit vector has $\sim K$ bits set.
$Key_1$ | 00101010 |
$Key_2$ | 01010110 |
$Key_3$ | 10000110 |
$Key_4$ | 01001100 |
$Key_1$ | 00101010 |
$Key_2$ | 01010110 |
$Key_3$ | 10000110 |
$Key_4$ | 01001100 |
$Key_1 \;\&\; 01111110$ | 00101010 | ✅ |
$Key_3 \;\&\; 01111110$ | 00101010 | ❌ |
$Key_4 \;\&\; 01111110$ | 01001100 | ✅ |
(False positive)