March 2, 2021
Leveraging Organization
$150 | $50 |
Index ToC |
No Index ToC Summary |
$\sigma_C(R)$ and $(\ldots \bowtie_C R)$
(Finding records in a table really fast)
$\sigma_{R.A = 7}(R)$
Where is the data for key 7?
Option 1: Linear search
$O(N)$ IOs
Data is sorted on an attribute of interest (R.A)
Updates are not relevant
Option 2: Binary Search
$O(\log_2 N)$ IOs
Better, but still not ideal.
Idea: Precompute several layers of the decision tree and store them together.
... but what if we need more than one page?
Add more indirection!
Which of the following is better?
It's important that the trees be balanced
... but what if we need to update the tree?
Idea 1: Reserve space for new records
Just maintaining open space won't work forever...
Maintain Invariant: All Nodes ≥ 50% Full
(Exception: The Root)
Deletions reverse this process (at 50% fill).
$\sigma_C(R)$ and $(\ldots \bowtie_C R)$
Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$
Possible Implementations:
Sort data on $(A, B, C, \ldots)$
First sort on $A$, $B$ is a tiebreaker for $A$,
$C$ is a tiebreaker for $B$, etc...
Which one do we pick?
(You need to know the cost of each plan)
These are called "Access Paths"
What if we need multiple sort orders?
A hash function $h(k)$ is ...
$h(k)\mod N$ gives you a random number in $[0, N)$
Idea: Resize the structure as needed
To keep things simple, let's use $$h(k) = k$$
(you wouldn't actually do this in practice)
Changing hash functions reallocates everything randomly
Need to keep the entire source and hash table in memory!
if $h(k) = x \mod N$
then
$h(k) = $ either $x$ or $2x \mod 2N$
Each key is moved (or not) to precisely one of two buckets in the resized hash table.
Never need more than 3 pages in memory at once.
Changing sizes still requires reading everything!
Idea: Only redistribute buckets that are too big
Add a directory (a level of indirection)
Next time: LSM Trees and CDF-Indexing