March 4, 2021
Reads Want: Nice sorted, compact list.
Writes Want: Random order.
What happens if we optimize for reads
(and even do away with pages)
Each insert requires $O(N)$ write IOs.
"Write Amplification"
How can we reduce Write Amplification?
Idea: Buffer writes
Pro: With a $\mathcal B$ element buffer, $O(\frac{N}{\mathcal B})$ write amplification (amortized)
Con: Every read now needs to go to two places
Con: $O(\frac{N}{\mathcal B})$ is still linear
Idea: Don't merge!
Pro: No write amplification!
Con: Every read now needs to go to $O(\frac{N}{B})$ places on disk (Read Amplification)
Idea: Combine the two?
In general, level $i$ contains $2^{i-1} \mathcal B$ records.
When writing to a full level, merge and write to next level instead.
Write Amplification: Every record copied $O(\log N)$ times
Read Amplification: At most $O(\log N)$ levels
"The Case for Learned Index Structures"
by Kraska, Beutel, Chi, Dean, Polyzotis
$f(key) \mapsto position$
(not exactly true, but close enough for today)
Simplified Use Case: Static data with "infinite" prep time.
We have infinite prep time, so fit a (tiny) neural network to the CDF.
if
statements are really expensive on modern processors.
SELECT partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
GROUP BY partkey;
All of these views share the same business logic!
Started as a convenience
CREATE VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
SELECT partkey FROM salesSinceLastMonth
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY partkey;
But also useful for performance
CREATE MATERIALIZED VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
Materializing the view, or pre-computing and saving the view lets us answer all of the queries on the view faster!
What if the query doesn't use the view?
SELECT l.partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(’2015-03-31’)
ORDER BY l.shipdate DESC
LIMIT 10;
Can we detect that a query could be answered with a view?
(sometimes)
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
When are we allowed to rewrite this table?
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
SELECT $L_Q$
FROM $(R_Q - R_V)$, view
WHERE $C_Q$
Enumerate all possible plans
... then how do you pick? (more soon)