CSE-4/562 Spring 2019 - Cost-Based Optimization

### Cost-Based Optimization

#### CSE-4/562 Spring 2019

February 27, 2019

### Remember the Real Goals

1. Accurately rank the plans.
2. Don't spend more time optimizing than you get back.
3. Don't pick a plan that uses more memory than you have.

### Accounting

Figure out the cost of each individual operator.

Only count the number of IOs added by each operator.

Table Scan $R$ $\frac{|R|}{\mathcal P}$ $O(1)$
Projection $\pi(R)$ $0$ $O(1)$
Selection $\sigma(R)$ $0$ $O(1)$
Union $R \cup S$ $0$ $O(1)$
Sort (In-Mem) $\tau(R)$ $0$ $O(|R|)$
Sort (On-Disk) $\tau(R)$ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ $O(\mathcal B)$
(B+Tree) Index Scan $Index(R, c)$ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ $O(1)$
(Hash) Index Scan $Index(R, c)$ $1$ $O(1)$
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
Nested Loop Join (Buffer $S$ in mem) $R \times S$ $0$ $O(|S|)$
Nested Loop Join (Buffer $S$ on disk) $R \times_{disk} S$ $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ $O(1)$
1-Pass Hash Join $R \bowtie_{1PH, c} S$ $0$ $O(|S|)$
2-Pass Hash Join $R \bowtie_{2PH, c} S$ $\frac{2|R| + 2|S|}{\mathcal P}$ $O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$ [Sort] [Sort]
(Tree) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ $O(1)$
(Hash) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot 1$ $O(1)$
(In-Mem) Aggregate $\gamma_A(R)$ $0$ $adom(A)$
(Sort/Merge) Aggregate $\gamma_A(R)$ [Sort] [Sort]
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
5. Number of distinct values of $A$ ($adom(A)$)

### Cardinality Estimation

#### (The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation)
How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation)
How many distinct values of attribute(s) $A$ exist?

### Remember the Real Goals

1. Accurately rank the plans.
2. Don't spend more time optimizing than you get back.

### (Some) Estimation Techniques

Guess Randomly
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

### (Some) Estimation Techniques

Guess Randomly
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

Idea 1: Pick 100 tuples at random from each input table.

Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$

It takes $O(\sqrt{N})$ samples from both $R$ and $S$
to get even one match.

To be resumed later in the term when we talk about AQP

How DBs Do It: Instrument queries while running them.

• The first time you run a query it might be slow.
• The second, third, fourth, etc... times it'll be fast.

### (Some) Estimation Techniques

Guess Randomly
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

### Limitations of Uniform Prior

Don't always have statistics for $Q$
For example, $\pi_{A \leftarrow (B \times C)}(R)$
Don't always have clear rules for $c$
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

Ideal Case: You have some $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ (and similarly for the other aggregates)

Slightly Less Ideal Case: You have some $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$

If this sounds like CDF-based indexing... you're right!

... but we're not going to talk about NNs today

Simpler/Faster Idea: Break $f(x)$ into chunks

### Example Data

Name YearsEmployed Role
'Alice' 3 1
'Bob' 2 2
'Carol' 3 1
'Dave' 1 3
'Eve' 2 2
'Fred' 2 3
'Gwen' 4 1
'Harry' 2 3

### Histograms

YearsEmployedCOUNT
1 1
2 4
3 2
4 1
 COUNT(DISTINCT YearsEmployed) $= 4$ MIN(YearsEmployed) $= 1$ MAX(YearsEmplyed) $= 4$ COUNT(*) YearsEmployed = 2 $= 4$

### Histograms

YearsEmployedCOUNT
1-2 5
3-4 3
 COUNT(DISTINCT YearsEmployed) $= 4$ MIN(YearsEmployed) $= 1$ MAX(YearsEmplyed) $= 4$ COUNT(*) YearsEmployed = 2 $= \frac{5}{2}$

### The Extreme Case

YearsEmployedCOUNT
1-4 8
 COUNT(DISTINCT YearsEmployed) $= 4$ MIN(YearsEmployed) $= 1$ MAX(YearsEmplyed) $= 4$ COUNT(*) YearsEmployed = 2 $= \frac{8}{4}$

### More Example Data

Value COUNT
1-10 20
11-20 0
21-30 15
31-40 30
41-50 22
51-60 63
61-70 10
71-80 10
 SELECT … WHERE A = 33 $= \frac{1}{40-30}\cdot 30 = 3$ SELECT … WHERE A > 33 $= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$

### (Some) Estimation Techniques

Guess Randomly
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

### Key / Unique Constraints


CREATE TABLE R (
A int,
B int UNIQUE
...
PRIMARY KEY A
);


No duplicate values in the column. $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$

### Foreign Key Constraints


CREATE TABLE S (
B int,
...
FOREIGN KEY B REFERENCES R.B
);


All values in the column appear in another table. $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$

### Functional Dependencies


Not expressible in SQL


One set of columns uniquely determines another.
$\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and... $$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$

### Constraints

#### The Good

• Sanity check on your data: Inconsistent data triggers failures.
• More opportunities for query optimization.

#### The Not-So Good

• Validating constraints whenever data changes is (usually) expensive.
• Inconsistent data triggers failures.

### Foreign Key Constraints

Foreign keys are like pointers. What happens with broken pointers?

### Foreign Key Enforcement

Foreign keys are defined with update triggers ON INSERT [X], ON UPDATE [X], ON DELETE [X]. Depending on what [X] is, the constraint is enforced differently:

CASCADE
Create/delete rows as needed to avoid invalid foreign keys.
NO ACTION
Abort any transaction that ends with an invalid foreign key reference.
SET NULL
Automatically replace any invalid foreign key references with NULL
.

CASCADE and NO ACTION ensure that the data never has broken pointers, so

$$\pi_{attrs(S)}\left(S \bowtie_B R\right) = S$$

### Functional Dependencies

A generalization of keys: One set of attributes that uniquely identify another.

• SS# uniquely identifies Name.
• Employee uniquely identifies Manager.
• Order number uniquely identifies Customer Address.

Two rows with the same As must have the same Bs

(but can still have identical Bs for two different As)

### Normal Forms

"All functional dependencies should be keys."

(Otherwise you want two separate relations)

(for more details, see CSE 560)

$$P(A = B) = min\left(\frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}, \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } B)}\right)$$

$$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$ (and $S.B$ is a foreign key referencing $R.A$)

The (foreign) key constraint gives us two things... $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$ and $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$

Based on the first property the total number of rows is roughly... $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$

Then based on the second property... $$= |R| \times |S| \times \frac{1}{|R|} = |S|$$

(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)