February 22, 2019
Some filesystems (HDFS, S3, SSDs) don't like updates
You don't update data, you rewrite the entire file (or a large fragment of it).
Idea 1: Buffer updates, periodically write out new blocks to a "log".
Idea 2: Keep data on disk sorted. Buffer updates. Periodically merge-sort buffer into the data.
Idea 3: Keep data on disk sorted, and in multiple "levels". Buffer updates.
Key observation: Level $i$ is $2^{i-1}$ times the size of the buffer (the size of the level doubles with each merge).
Result: Each record copied at most $\log(N)$ times.
"The Case for Learned Index Structures"
by Kraska, Beutel, Chi, Dean, Polyzotis
$f(key) \mapsto position$
(not exactly true, but close enough for today)
Simplified Use Case: Static data with "infinite" prep time.
We have infinite prep time, so fit a (tiny) neural network to the CDF.
if
statements are really expensive on modern processors.$\sigma_C(R)$ and $(\ldots \bowtie_C R)$
Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$
Possible Implementations:
Sort data on $(A, B, C, \ldots)$
First sort on $A$, $B$ is a tiebreaker for $A$,
$C$ is a tiebreaker for $B$, etc...
Which one do we pick?
(You need to know the cost of each plan)
These are called "Access Paths"
SELECT partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
GROUP BY partkey;
All of these views share the same business logic!
Started as a convenience
CREATE VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
SELECT partkey FROM salesSinceLastMonth
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY partkey;
But also useful for performance
CREATE MATERIALIZED VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
Materializing the view, or pre-computing and saving the view lets us answer all of the queries on the view faster!
What if the query doesn't use the view?
SELECT l.partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(’2015-03-31’)
ORDER BY l.shipdate DESC
LIMIT 10;
Can we detect that a query could be answered with a view?
(sometimes)
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
When are we allowed to rewrite this table?
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
View Query | User Query | |
---|---|---|
SELECT $L_v$ FROM $R_v$ WHERE $C_v$
|
SELECT $L_q$ FROM $R_q$ WHERE $C_q$
|
SELECT $L_Q$
FROM $(R_Q - R_V)$, view
WHERE $C_Q$
Enumerate all possible plans
... then how do you pick? (more next class)