February 18, 2021
Databases are usually IO- or Memory-bound
Core Question: Do we have enough memory to use this operator?
IO measured in:
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
So far, we've been pretending that each operator has one algorithm.
Often, there are many algorithms, some of which cover multiple operators.
This is why Spark has a PhysicalPlan
In the suggested Iterator-based approach the Iterators are your PhysicalPlan
Example, assume $R$ is 100 tuples.
How many IOs do we need to compute $Q := R$
How many IOs do we need to compute $Q := \sigma(R)$
Example, assume $R$ is 100 tuples.
How many IOs do we need to compute $Q := \pi(R)$
How many IOs do we need to compute $Q := \pi(\sigma(R))$
Projection and Selection do not add IO.
How do you "reset" $S$?
This can get very expensive
Example, assume $R$ and $S$ are both 100 tuples.
How many IOs do we need to compute $Q := R \cup S$?
Example, assume $R$ is 20 tuples and $S$ is 100 tuples.
How many IOs do we need to compute $Q := R \times S$?
Best Total Cost $100 + 20 + 1900 = 2010$
Example, assume $R$ is 20 tuples and $S$ is 100 tuples,
and $c$ filters out 90% of tuples.
How many IOs do we need to compute $Q := R \times \sigma_c(R \times S)$
Best Total Cost $2010 + 20 + 4000 = 6030$
Can we do better with cartesian product
(and joins)?
Problem: We need to evaluate rhs
iterator
once per record in lhs
Better Solution: Load both lhs
and rhs
records in blocks.
def apply_cross(lhs, rhs):
result = []
while r_block = lhs.take(100):
while s_block = rhs.take(100):
for r in r_block:
for s in s_block:
result += [r + s]
rhs.reset()
return result
(with $\mathcal B$ as the block size for $R$)
(and with caching $S$ to disk)
In-memory caching is a special case of block-nested loop with $\mathcal B = |R|$
Does the block size for $S$ matter?
How big should the blocks be?
As big as possible!
... but more on that later.
Cross product is expensive!
Can we do better?
$\sigma_c(R\times S) \equiv R\bowtie_c S$
Problem: Naively, any tuple matches any other
Solution: First organize the data
$$h(X) \mod N$$
Alternative: Build an in-memory tree (e.g., B+Tree) instead of a hash table!
Why is it important that the hash function is pseudorandom?
What if the data is already organized (e.g., sorted) in a useful way?
Extended Relational Algebra