CSE 4/562 - Database Systems

- Accurately
**rank**the plans. - Don't spend more time optimizing than you get back.
- Don't pick a plan that uses more memory than you have.

Figure out the cost of each **individual** operator.

Only count the number of IOs **added** by each operator.

Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|

Table Scan | $R$ | $\frac{|R|}{\mathcal P}$ | $O(1)$ |

Projection | $\pi(R)$ | $0$ | $O(1)$ |

Selection | $\sigma(R)$ | $0$ | $O(1)$ |

Union | $R \cup S$ | $0$ | $O(1)$ |

Sort (In-Mem) | $\tau(R)$ | $0$ | $O(|R|)$ |

Sort (On-Disk) | $\tau(R)$ | $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | $O(\mathcal B)$ |

(B+Tree) Index Scan | $Index(R, c)$ | $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | $O(1)$ |

(Hash) Index Scan | $Index(R, c)$ | $1$ | $O(1)$ |

- Tuples per Page ($\mathcal P$) – Normally defined per-schema
- Size of $R$ ($|R|$)
- Pages of Buffer ($\mathcal B$)
- Keys per Index Page ($\mathcal I$)

Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|

Nested Loop Join (Buffer $S$ in mem) | $R \times S$ | $0$ | $O(|S|)$ |

Nested Loop Join (Buffer $S$ on disk) | $R \times_{disk} S$ | $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | $O(1)$ |

1-Pass Hash Join | $R \bowtie_{1PH, c} S$ | $0$ | $O(|S|)$ |

2-Pass Hash Join | $R \bowtie_{2PH, c} S$ | $\frac{2|R| + 2|S|}{\mathcal P}$ | $O(1)$ |

Sort-Merge Join | $R \bowtie_{SM, c} S$ | [Sort] | [Sort] |

(Tree) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | $O(1)$ |

(Hash) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot 1$ | $O(1)$ |

(In-Mem) Aggregate | $\gamma_A(R)$ | $0$ | $adom(A)$ |

(Sort/Merge) Aggregate | $\gamma_A(R)$ | [Sort] | [Sort] |

- Tuples per Page ($\mathcal P$) – Normally defined per-schema
- Size of $R$ ($|R|$)
- Pages of Buffer ($\mathcal B$)
- Keys per Index Page ($\mathcal I$)
- Number of distinct values of $A$ ($adom(A)$)

Estimating IOs requires Estimating $|Q(R)|$

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.

Operator | RA | Estimated Size |
---|---|---|

Table | $R$ | $|R|$ |

Projection | $\pi(Q)$ | $|Q|$ |

Union | $Q_1 \uplus Q_2$ | $|Q_1| + |Q_2|$ |

Cross Product | $Q_1 \times Q_2$ | $|Q_1| \times |Q_2|$ |

Sort | $\tau(Q)$ | $|Q|$ |

Limit | $\texttt{LIMIT}_N(Q)$ | $N$ |

Selection | $\sigma_c(Q)$ | $|Q| \times \texttt{SEL}(c, Q)$ |

Join | $Q_1 \bowtie_c Q_2$ | $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ |

Distinct | $\delta_A(Q)$ | $\texttt{UNIQ}(A, Q)$ |

Aggregate | $\gamma_{A, B \leftarrow \Sigma}(Q)$ | $\texttt{UNIQ}(A, Q)$ |

- $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
- $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

- $\sigma_c(Q)$ (Cardinality Estimation)
- How many tuples will a condition $c$ allow to pass?
- $\delta_A(Q)$ (Distinct Values Estimation)
- How many distinct values of attribute(s) $A$ exist?

- Accurately
**rank**the plans. - Don't spend more time optimizing than you get back.

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

Every select or distinct operator passes 10% of all rows.

$$\sigma_{A = 1 \wedge B = 2}(R)$$

$$|\sigma_{A = 1 \wedge B = 2}(R)| = 0.1 \cdot |R|$$

$$\sigma_{A = 1}(\sigma_{B = 2}(R))$$

$$|\sigma_{A = 1}(\sigma_{B = 2}(R))| = 0.1 \cdot |\sigma_{B = 2}(R)| = 0.1 \cdot 0.1 \cdot |R|$$

(Queries are typically standardized first)

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first `AND`

clause, and 75% for every subsequent clause)

The 10% rule is a fallback when everything else fails.

Usually, databases collect statistics...

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

- Basic statistics are known about $Q$:
`COUNT(*)`

`COUNT(DISTINCT A)`

(for each A)`MIN(A)`

,`MAX(A)`

(for each numeric A)

- Attribute values are uniformly distributed.
- No inter-attribute correlations.

If (1) fails, fall back to the 10% rule.

If (2) or (3) fails, it'll often still be a *good enough* estimate.

Estimating $\delta_A(Q)$ requires only `COUNT(DISTINCT A)`

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

$P(A = x_1)$ | $=$ | $\frac{1}{\texttt{COUNT(DISTINCT A)}}$ |

$P(A \in (x_1, x_2, \ldots, x_N))$ | $=$ | $\frac{N}{\texttt{COUNT(DISTINCT A)}}$ |

$P(A \leq x_1)$ | $=$ | $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |

$P(x_1 \leq A \leq x_2)$ | $=$ | $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |

$P(A = B)$ | $=$ | $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ |

$P(c_1 \wedge c_2)$ | $=$ | $P(c_1) \cdot P(c_2)$ |

$P(c_1 \vee c_2)$ | $=$ | $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$ |

(With constants $x_1$, $x_2$, ...)

- Don't always have statistics for $Q$
- For example, $\pi_{A \leftarrow (B \times C)}(R)$
- Don't always have clear rules for $c$
- For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
- Attribute values are not always uniformly distributed.
- For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
- Attribute values are sometimes correlated.
- For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

**Idea 1:** Pick 100 tuples at random from each input table.

Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$

It takes $O(\sqrt{N})$ samples from both $R$ and $S$

to get even **one match.**

To be resumed later in the term when we talk about AQP

**How DBs Do It**: Instrument queries while running them.

- The first time you run a query it
*might*be slow. - The second, third, fourth, etc... times it'll be fast.

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

- Don't always have statistics for $Q$
- For example, $\pi_{A \leftarrow (B \times C)}(R)$
- Don't always have clear rules for $c$
- For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
- Attribute values are not always uniformly distributed.
- For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
- Attribute values are sometimes correlated.
- For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

**Ideal Case:** You have some
$$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$
(and similarly for the other aggregates)

**Slightly Less Ideal Case:** You have some
$$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$

If this sounds like CDF-based indexing... you're right!

... but we're not going to talk about NNs today

**Simpler/Faster Idea: ** Break $f(x)$ into chunks

Name | YearsEmployed | Role |
---|---|---|

'Alice' | 3 | 1 |

'Bob' | 2 | 2 |

'Carol' | 3 | 1 |

'Dave' | 1 | 3 |

'Eve' | 2 | 2 |

'Fred' | 2 | 3 |

'Gwen' | 4 | 1 |

'Harry' | 2 | 3 |

YearsEmployed | COUNT |
---|---|

1 | 1 |

2 | 4 |

3 | 2 |

4 | 1 |

`COUNT(DISTINCT YearsEmployed)` | $= 4$ |

`MIN(YearsEmployed)` | $= 1$ |

`MAX(YearsEmplyed)` | $= 4$ |

`COUNT(*) YearsEmployed = 2` | $= 4$ |

YearsEmployed | COUNT |
---|---|

1-2 | 5 |

3-4 | 3 |

`COUNT(DISTINCT YearsEmployed)` | $= 4$ |

`MIN(YearsEmployed)` | $= 1$ |

`MAX(YearsEmplyed)` | $= 4$ |

`COUNT(*) YearsEmployed = 2` | $= \frac{5}{2}$ |

YearsEmployed | COUNT |
---|---|

1-4 | 8 |

`COUNT(DISTINCT YearsEmployed)` | $= 4$ |

`MIN(YearsEmployed)` | $= 1$ |

`MAX(YearsEmplyed)` | $= 4$ |

`COUNT(*) YearsEmployed = 2` | $= \frac{8}{4}$ |

Value | COUNT |
---|---|

1-10 | 20 |

11-20 | 0 |

21-30 | 15 |

31-40 | 30 |

41-50 | 22 |

51-60 | 63 |

61-70 | 10 |

71-80 | 10 |

`SELECT … WHERE A = 33` |
$= \frac{1}{40-30}\cdot 30 = 3$ |

`SELECT … WHERE A > 33` |
$= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$ |

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

- Guess Randomly
- Rules of thumb if you have no other options...
- Uniform Prior
- Use basic statistics to make a very rough guess.
- Sampling / History
- Small, Quick Sampling Runs (or prior executions of the query).
- Histograms
- Using more detailed statistics for improved guesses.
- Constraints
- Using rules about the data for improved guesses.

```
CREATE TABLE R (
A int,
B int UNIQUE
...
PRIMARY KEY A
);
```

No duplicate values in the column. $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$

```
CREATE TABLE S (
B int,
...
FOREIGN KEY B REFERENCES R.B
);
```

All values in the column appear in another table. $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$

```
Not expressible in SQL
```

One set of columns uniquely determines another.

$\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and...
$$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$

- Sanity check on your data: Inconsistent data triggers failures.
- More opportunities for query optimization.

- Validating constraints whenever data changes is (usually) expensive.
- Inconsistent data triggers failures.

Foreign keys are like pointers. What happens with broken pointers?

Foreign keys are defined with update triggers `ON INSERT [X]`

, `ON UPDATE [X]`

, `ON DELETE [X]`

. Depending on what [X] is, the constraint is enforced differently:

`CASCADE`

- Create/delete rows as needed to avoid invalid foreign keys.
`NO ACTION`

- Abort any transaction that ends with an invalid foreign key reference.
`SET NULL`

- Automatically replace any invalid foreign key references with NULL .

`CASCADE`

and `NO ACTION`

ensure that the data never has broken pointers, so

**A generalization of keys:** One set of attributes that uniquely identify another.

- SS# uniquely identifies Name.
- Employee uniquely identifies Manager.
- Order number uniquely identifies Customer Address.

Two rows with the same As must have the same Bs

(but can still have identical Bs for two different As)

"All functional dependencies should be keys."

(Otherwise you want two separate relations)

(for more details, see CSE 560)

$$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$ (and $S.B$ is a foreign key referencing $R.A$)

The (foreign) key constraint gives us two things... $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$ and $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$

Based on the first property the total number of rows is roughly... $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$

Then based on the second property... $$ = |R| \times |S| \times \frac{1}{|R|} = |S|$$

(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)

**Next class:** Exam Review