February 27, 2019
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | $R$ | $\frac{|R|}{\mathcal P}$ | $O(1)$ |
Projection | $\pi(R)$ | $0$ | $O(1)$ |
Selection | $\sigma(R)$ | $0$ | $O(1)$ |
Union | $R \cup S$ | $0$ | $O(1)$ |
Sort (In-Mem) | $\tau(R)$ | $0$ | $O(|R|)$ |
Sort (On-Disk) | $\tau(R)$ | $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | $O(\mathcal B)$ |
(B+Tree) Index Scan | $Index(R, c)$ | $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | $O(1)$ |
(Hash) Index Scan | $Index(R, c)$ | $1$ | $O(1)$ |
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer $S$ in mem) | $R \times S$ | $0$ | $O(|S|)$ |
Nested Loop Join (Buffer $S$ on disk) | $R \times_{disk} S$ | $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | $O(1)$ |
1-Pass Hash Join | $R \bowtie_{1PH, c} S$ | $0$ | $O(|S|)$ |
2-Pass Hash Join | $R \bowtie_{2PH, c} S$ | $\frac{2|R| + 2|S|}{\mathcal P}$ | $O(1)$ |
Sort-Merge Join | $R \bowtie_{SM, c} S$ | [Sort] | [Sort] |
(Tree) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | $O(1)$ |
(Hash) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot 1$ | $O(1)$ |
(In-Mem) Aggregate | $\gamma_A(R)$ | $0$ | $adom(A)$ |
(Sort/Merge) Aggregate | $\gamma_A(R)$ | [Sort] | [Sort] |
Idea 1: Pick 100 tuples at random from each input table.
Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$
It takes $O(\sqrt{N})$ samples from both $R$ and $S$
to get even one match.
To be resumed later in the term when we talk about AQP
How DBs Do It: Instrument queries while running them.
Ideal Case: You have some $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ (and similarly for the other aggregates)
Slightly Less Ideal Case: You have some $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$
If this sounds like CDF-based indexing... you're right!
... but we're not going to talk about NNs today
Simpler/Faster Idea: Break $f(x)$ into chunks
Name | YearsEmployed | Role |
---|---|---|
'Alice' | 3 | 1 |
'Bob' | 2 | 2 |
'Carol' | 3 | 1 |
'Dave' | 1 | 3 |
'Eve' | 2 | 2 |
'Fred' | 2 | 3 |
'Gwen' | 4 | 1 |
'Harry' | 2 | 3 |
YearsEmployed | COUNT |
---|---|
1 | 1 |
2 | 4 |
3 | 2 |
4 | 1 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= 4$ |
YearsEmployed | COUNT |
---|---|
1-2 | 5 |
3-4 | 3 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= \frac{5}{2}$ |
YearsEmployed | COUNT |
---|---|
1-4 | 8 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= \frac{8}{4}$ |
Value | COUNT |
---|---|
1-10 | 20 |
11-20 | 0 |
21-30 | 15 |
31-40 | 30 |
41-50 | 22 |
51-60 | 63 |
61-70 | 10 |
71-80 | 10 |
SELECT … WHERE A = 33 |
$= \frac{1}{40-30}\cdot 30 = 3$ |
SELECT … WHERE A > 33 |
$= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$ |
CREATE TABLE R (
A int,
B int UNIQUE
...
PRIMARY KEY A
);
No duplicate values in the column. $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$
CREATE TABLE S (
B int,
...
FOREIGN KEY B REFERENCES R.B
);
All values in the column appear in another table. $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$
Not expressible in SQL
One set of columns uniquely determines another.
$\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and...
$$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$
Foreign keys are like pointers. What happens with broken pointers?
Foreign keys are defined with update triggers ON INSERT [X]
, ON UPDATE [X]
, ON DELETE [X]
. Depending on what [X] is, the constraint is enforced differently:
CASCADE
NO ACTION
SET NULL
CASCADE
and NO ACTION
ensure that the data never has broken pointers, so
A generalization of keys: One set of attributes that uniquely identify another.
Two rows with the same As must have the same Bs
(but can still have identical Bs for two different As)
"All functional dependencies should be keys."
(Otherwise you want two separate relations)
(for more details, see CSE 560)
$$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$ (and $S.B$ is a foreign key referencing $R.A$)
The (foreign) key constraint gives us two things... $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$ and $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$
Based on the first property the total number of rows is roughly... $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$
Then based on the second property... $$ = |R| \times |S| \times \frac{1}{|R|} = |S|$$
(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)