CSE-4/562 Spring 2019 - Cost-Based Optimization

### Cost-Based Optimization

#### CSE-4/562 Spring 2019

February 25, 2019

### General Query Optimizers

1. Apply blind heuristics (e.g., push down selections)
2. Enumerate all possible execution plans by varying (or for a reasonable subset)
• Join/Union Evaluation Order (commutativity, associativity, distributivity)
• Algorithms for Joins, Aggregates, Sort, Distinct, and others
• Data Access Paths
3. Estimate the cost of each execution plan
4. Pick the execution plan with the lowest cost

Idea 1: Run each plan © Paramount Pictures

If we can't get the exact cost of a plan, what can we do?

Idea 2: Run each plan on a small sample of the data.

Idea 3: Analytically estimate the cost of a plan.

### Plan Cost

CPU Time
How much time is spent processing.
# of IOs
How many random reads + writes go to disk.
Memory Required
How much memory do you need. Randal Munroe (cc-by-nc)

### Remember the Real Goals

1. Accurately rank the plans.
2. Don't spend more time optimizing than you get back.
3. Don't pick a plan that uses more memory than you have.

### Accounting

Figure out the cost of each individual operator.

Only count the number of IOs added by each operator.

Table Scan $R$ $\frac{|R|}{\mathcal P}$ $O(1)$
Projection $\pi(R)$ $0$ $O(1)$
Selection $\sigma(R)$ $0$ $O(1)$
Union $R \uplus S$ $0$ $O(1)$
Sort (In-Mem) $\tau(R)$ $0$ $O(|R|)$
Sort (On-Disk) $\tau(R)$ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ $O(\mathcal B)$
(B+Tree) Index Scan $Index(R, c)$ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ $O(1)$
(Hash) Index Scan $Index(R, c)$ $1$ $O(1)$
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
Nested Loop Join (Buffer $S$ in mem) $R \times S$ $0$ $O(|S|)$
Nested Loop Join (Buffer $S$ on disk) $R \times_{disk} S$ $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ $O(1)$
1-Pass Hash Join $R \bowtie_{1PH, c} S$ $0$ $O(|S|)$
2-Pass Hash Join $R \bowtie_{2PH, c} S$ $\frac{2|R| + 2|S|}{\mathcal P}$ $O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$ [Sort] [Sort]
(Tree) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ $O(1)$
(Hash) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot 1$ $O(1)$
(In-Mem) Aggregate $\gamma_A(R)$ $0$ $adom(A)$
(Sort/Merge) Aggregate $\gamma_A(R)$ [Sort] [Sort]
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
5. Number of distinct values of $A$ ($adom(A)$)
SymbolParameterType
$\mathcal P$Tuples Per Page Fixed ($\frac{|\text{page}|}{|\text{tuple}|}$)
$|R|$Size of $R$ Precomputed$^*$ ($|R|$)
$\mathcal B$Pages of Buffer Configurable Parameter
$\mathcal I$Keys per Index Page Fixed ($\frac{|\text{page}|}{|\text{key+pointer}|}$)
$adom(A)$Number of distinct values of $A$ Precomputed$^*$ ($|\delta_A(R)|$)

* unless $R$ is a query

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

### Cardinality Estimation

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.

Operator RA Estimated Size
Table $R$ $|R|$
Projection $\pi(Q)$ $|Q|$
Union $Q_1 \uplus Q_2$ $|Q_1| + |Q_2|$
Cross Product $Q_1 \times Q_2$ $|Q_1| \times |Q_2|$
Sort $\tau(Q)$ $|Q|$
Limit $\texttt{LIMIT}_N(Q)$ $N$
Selection $\sigma_c(Q)$ $|Q| \times \texttt{SEL}(c, Q)$
Join $Q_1 \bowtie_c Q_2$ $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct $\delta_A(Q)$ $\texttt{UNIQ}(A, Q)$
Aggregate $\gamma_{A, B \leftarrow \Sigma}(Q)$ $\texttt{UNIQ}(A, Q)$
• $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
• $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

### Cardinality Estimation

#### (The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation)
How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation)
How many distinct values of attribute(s) $A$ exist?

Idea 1: Assume each selection filters down to 10% of the data. no... really!

### ... there are problems

#### Inconsistent estimation

$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$

#### Too consistent estimation

$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$

... but remember that all we need is to rank plans.

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND clause, and 75% for every subsequent clause)

### (Some) Estimation Techniques

The 10% rule
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

### Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

1. Basic statistics are known about $Q$:
• COUNT(*)
• COUNT(DISTINCT A) (for each A)
• MIN(A), MAX(A) (for each numeric A)
2. Attribute values are uniformly distributed.
3. No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

### COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

### MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

### Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

 $P(A = x_1)$ $=$ $\frac{1}{\texttt{COUNT(DISTINCT A)}}$ $P(A \in (x_1, x_2, \ldots, x_N))$ $=$ $\frac{N}{\texttt{COUNT(DISTINCT A)}}$ $P(A \leq x_1)$ $=$ $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ $P(x_1 \leq A \leq x_2)$ $=$ $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ $P(A = B)$ $=$ $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ $P(c_1 \wedge c_2)$ $=$ $P(c_1) \cdot P(c_2)$ $P(c_1 \vee c_2)$ $=$ $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

### Limitations

Don't always have statistics for $Q$
For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
Don't always have clear rules for $c$
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

...but handles most usage patterns

... next class more!