- Parallel Query Execution

Parallel Query Execution

March 18, 2021

Garcia-Molina/Ullman/Widom: Ch. 20.1, 20.3, 20.4

Why Scale?

Scan 1 PB at 300MB/s (SATA r2)

Communication Models

Communication Models

Shared Memory

Non-Uniform Memory Access

Shared Nothing / Message Passing

Shared Memory
HDFS, S3, RAM+Modern OSes
NUMA
AMD CPUs
Shared Nothing
MPP, Actor Model

Today we want to clearly see the communications.

The Basic Model

Putting Workers Together

Wikipedia - Public Domain Image - Photographer Unknown

No Parallelism

freesvg.org

"Pipeline" Parallelism

freesvg.org

"Pipeline" Parallelism

freesvg.org

Together

freesvg.org

Distributing Data

Replication Partitioning (Sharding)

How to Partition Data?

  • Arbitrarily
  • Range
  • Hash

look familiar?

Can we run each worker on one partition?

Filter

Yes

$N$ partitions in, $N$ partitions out

Project

Yes

$N$ partitions in, $N$ partitions out

Union

Yes

Trick question, just combines partitions!

(Single-row) Aggregate

No

$N$ partitions in, $1$ partition out

Partial Aggregation

Algebraic Aggregates (Count, Sum, Avg, Min, Max)
Bounded-size intermediate state
Holistic Aggregates (Median, Mode, Count-Distinct)
Unbounded-size intermediate state
$$\Sigma_{SUM} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$

Aggregate needs to process $N$ partitions.

$$\Sigma_{SUM} (\Sigma_{SUM}(R_1) \uplus \Sigma_{SUM}(R_2) \uplus \ldots \uplus \Sigma_{SUM}(R_N))$$

Final aggregate only needs to process $N$ tuples.

$$\Sigma_{AVG} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$

↓↓↓↓↓

$$\Sigma_{\frac{SUM(A)}{SUM(B)}} (\Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_1) \ldots \Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_N))$$

Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.
Merge(Accum, Accum)
Merge two accumulators together.

Joins

No

Every partition from one table needs to pair
with every partition from the other.

$$(R_1 \uplus \ldots \uplus R_N) \bowtie (S_1 \uplus \ldots \uplus S_K)$$

↓↓↓↓↓↓

$$(R_1 \bowtie S_1) \uplus \ldots \uplus (R_1 \bowtie S_K)$$ $$\ldots\uplus \ldots \uplus \ldots$$ $$(R_N \bowtie S_1) \uplus \ldots \uplus (R_N \bowtie S_K)$$
$S_1$$S_2$$S_3$$S_4$
$R_1$ $R_1\bowtie S_1$$R_1\bowtie S_2$$R_1\bowtie S_3$$R_1\bowtie S_4$
$R_2$ $R_2\bowtie S_1$$R_2\bowtie S_2$$R_2\bowtie S_3$$R_2\bowtie S_4$
$R_3$ $R_3\bowtie S_1$$R_3\bowtie S_2$$R_3\bowtie S_3$$R_3\bowtie S_4$
$R_4$ $R_4\bowtie S_1$$R_4\bowtie S_2$$R_4\bowtie S_3$$R_4\bowtie S_4$

$N$ workers gets us $\sqrt{N}$ scaling

How to Partition Data?

  • Arbitrarily
  • Range
  • Hash
$$R \bowtie_{A} S$$ $$R_i = \sigma_{\texttt{hash}(A) = i}(R)$$ $$S_i = \sigma_{\texttt{hash}(A) = i}(S)$$
$S_1$$S_2$$S_3$$S_4$
$R_1$ $R_1\bowtie S_1$ $R_1\bowtie S_2$ $R_1\bowtie S_3$ $R_1\bowtie S_4$
$R_2$ $R_2\bowtie S_1$ $R_2\bowtie S_2$ $R_2\bowtie S_3$ $R_2\bowtie S_4$
$R_3$ $R_3\bowtie S_1$ $R_3\bowtie S_2$ $R_3\bowtie S_3$ $R_3\bowtie S_4$
$R_4$ $R_4\bowtie S_1$ $R_4\bowtie S_2$ $R_4\bowtie S_3$ $R_4\bowtie S_4$

Back to $N$ scaling for $N$ workers

What if the partitions aren't aligned so nicely?

Can we do better?

Focus on $R_1 \bowtie_B S_1$

Problem: All tuples in $R_1$ and $S_1$ need to be
sent to the same worker.

Data Transfer

  • Limited IO/Network bandwidth
  • Compute needed to receive data

Idea 1: Put the worker on the node that has the data!

Problem: What if the data is on 2 different nodes?

Idea 1.b: Put the worker on one of the nodes with data.

Can we reduce network use more?

Problem: Worker 2 is still sending a lot of data.

Idea: Compress $\pi_B(S_1)$

Lossy Compression

(not all errors are equal)

False Positives
($b \in \pi_B(S_1)$ when it isn't)
not ideal, but ok
False Negatives
($b \not\in \pi_B(S_1)$ when it is)
bad, wrong answer!

Bloom Filters

$$filter \leftarrow \texttt{Bloom}(\textbf{Alice}, \textbf{Bob}, \textbf{Carol}, \textbf{Dave})$$
User: Is Alice part of the set? $filter$: Yes
User: Is Eve part of the set? $filter$: No
User: Is Fred part of the set? $filter$: Yes

Bloom Filter

Test always returns Yes if the element is in the set.

Test usually returns No if the element is not in the set.

Bloom Filters

A bloom filter is an array of bits.

$M$: Number of bits in the array.

$K$: Number of hash functions.

For one record/key

  1. $\forall i \in [M] : filter[i] = 0$
  2. $\forall j \in [K] : filter[\texttt{hash}_j(key)] = 1$

Each bit vector has $\sim K$ bits set.

$Key_1$ 00101010
$Key_2$ 01010110
$Key_3$ 10000110
$Key_4$ 01001100

Filters are combined by Bitwise-OR

$$Key_1 \;|\; Key_2 = 01111110$$

Test for inclusion by checking for bits

$$Key_i \;\&\; filter = Key_i$$
$Key_1$ 00101010
$Key_2$ 01010110
$Key_3$ 10000110
$Key_4$ 01001100
$$Key_1 \;|\; Key_2 = 01111110$$
$Key_1 \;\&\; 01111110$ 00101010
$Key_3 \;\&\; 01111110$ 00101010
$Key_4 \;\&\; 01111110$ 01001100

(False positive)

Next time: Online Aggregation/AQP