March 11, 2019
What are Databases?
SELECT [DISTINCT] targetlist
FROM relationlist
WHERE condition
This is the least efficient strategy to compute a query! A good optimizer will find more efficient strategies to compute the same answer.
,
) between fields\r\n
) at record endRelational Algebra
Operation | Sym | Meaning |
---|---|---|
Selection | $\sigma$ | Select a subset of the input rows |
Projection | $\pi$ | Delete unwanted columns |
Cross-product | $\times$ | Combine two relations |
Set-difference | $-$ | Tuples in Rel 1, but not Rel 2 |
Union | $\cup$ | Tuples either in Rel 1 or in Rel 2 |
Intersection | $\cap$ | Tuples in both Rel 1 and Rel 2 |
Join | $\bowtie$ | Pairs of tuples matching a specified condition |
Division | $/$ | "Inverse" of cross-product |
Sort | $\tau_A$ | Sort records by attribute(s) $A$ |
Limit | $\texttt{LIMIT}_N$ | Return only the first $N$ records (according to sort order if paired with sort). |
Rule | Notes |
---|---|
$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$ | |
$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$ | Only true for set, not bag union |
$\sigma_C(R \times S) \equiv R \bowtie_C S$ | |
$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$ | If $C$ references only $R$'s attributes, also works for joins |
$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$ | |
$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$ | If $A$ contains all of the attributes referenced by $C$ |
$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$ | Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$) |
$R \times (S \times T) \equiv (R \times S) \times T$ | Also works for joins |
$R \times S \equiv S \times R$ | Also works for joins |
$R \cup (S \cup T) \equiv (R \cup S) \cup T$ | Also works for intersection and bag-union |
$R \cup S \equiv S \cup R$ | Also works for intersections and bag-union |
$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$ | Also works for intersections and bag-union |
$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$ | Also works for intersections and bag-union |
$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$ | If $A$ contains all of the attributes referenced by $C$ |
Sort/Merge typically expressed as 3 operators
(2xSort + Merge)
Can partition on data-values to support other types of queries.
Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Indexing
Maintain Invariant: All Nodes ≥ 50% Full
(Exception: The Root)
Cost-Based Optimization
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | $R$ | $\frac{|R|}{\mathcal P}$ | $O(1)$ |
Projection | $\pi(R)$ | $0$ | $O(1)$ |
Selection | $\sigma(R)$ | $0$ | $O(1)$ |
Union | $R \uplus S$ | $0$ | $O(1)$ |
Sort (In-Mem) | $\tau(R)$ | $0$ | $O(|R|)$ |
Sort (On-Disk) | $\tau(R)$ | $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | $O(\mathcal B)$ |
(B+Tree) Index Scan | $Index(R, c)$ | $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | $O(1)$ |
(Hash) Index Scan | $Index(R, c)$ | $1$ | $O(1)$ |
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer $S$ in mem) | $R \times S$ | $0$ | $O(|S|)$ |
Nested Loop Join (Buffer $S$ on disk) | $R \times_{disk} S$ | $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | $O(1)$ |
1-Pass Hash Join | $R \bowtie_{1PH, c} S$ | $0$ | $O(|S|)$ |
2-Pass Hash Join | $R \bowtie_{2PH, c} S$ | $\frac{2|R| + 2|S|}{\mathcal P}$ | $O(1)$ |
Sort-Merge Join | $R \bowtie_{SM, c} S$ | [Sort] | [Sort] |
(Tree) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | $O(1)$ |
(Hash) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot 1$ | $O(1)$ |
(In-Mem) Aggregate | $\gamma_A(R)$ | $0$ | $adom(A)$ |
(Sort/Merge) Aggregate | $\gamma_A(R)$ | [Sort] | [Sort] |
Estimating IOs requires Estimating $|Q(R)|$
Operator | RA | Estimated Size |
---|---|---|
Table | $R$ | $|R|$ |
Projection | $\pi(Q)$ | $|Q|$ |
Union | $Q_1 \uplus Q_2$ | $|Q_1| + |Q_2|$ |
Cross Product | $Q_1 \times Q_2$ | $|Q_1| \times |Q_2|$ |
Sort | $\tau(Q)$ | $|Q|$ |
Limit | $\texttt{LIMIT}_N(Q)$ | $N$ |
Selection | $\sigma_c(Q)$ | $|Q| \times \texttt{SEL}(c, Q)$ |
Join | $Q_1 \bowtie_c Q_2$ | $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ |
Distinct | $\delta_A(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Aggregate | $\gamma_{A, B \leftarrow \Sigma}(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Flips | Score | Probability | E[# Games] |
---|---|---|---|
(👽) | 0 | 0.5 | 2 |
(🐕)(👽) | 1 | 0.25 | 4 |
(🐕)(🐕)(👽) | 2 | 0.125 | 8 |
(🐕)$\times N$ (👽) | $N$ | $\frac{1}{2^{N+1}}$ | $2^{N+1}$ |
If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.
To do that, I only need to track my top score!
WINDOW
Operator
SELECT L.state, T.month,
AVG(S.sales) OVER W as movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid = T.timeid
AND S.locid = L.locid
WINDOW W AS (
PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
AND INTERVAL ‘1’ MONTH FOLLOWING
)
WINDOW
joins.