February 11, 2019
Databases are usually IO- or Memory-bound
Core Question: Do we have enough memory to use this operator?
IO measured in:
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
If $S$ is a query, this can get very expensive
Optimization 1: Cache $S$ in memory
Optimization 2: Cache $S$ on disk
Is there a middle ground?
Problem: We need to evaluate rhs
iterator
once per record in lhs
Better Solution: Load both lhs
and rhs
records in blocks.
def apply_cross(lhs, rhs):
result = []
while r_block = lhs.take(100):
while s_block = rhs.take(100):
for r in r_block:
for s in s_block:
result += [r + s]
rhs.reset()
return result
(with $\mathcal B$ as the block size for $R$)
(and with caching $S$ to disk)
In-memory caching is a special case of block-nested loop with $\mathcal B = |S|$
Does the block size for $R$ matter?
How big should the blocks be?
Cross product is expensive!
Can we do better?
$\sigma_c(R\times S) \equiv R\bowtie_c S$
Problem: Naively, any tuple matches any other
Solution: First organize the data
$$h(X) \mod N$$
Alternative: Build an in-memory tree (e.g., B+Tree) instead of a hash table!
Why is it important that the hash function is pseudorandom?
What if the data is already organized (e.g., sorted) in a useful way?