February 25, 2019
Idea 1: Run each plan
If we can't get the exact cost of a plan, what can we do?
Idea 2: Run each plan on a small sample of the data.
Idea 3: Analytically estimate the cost of a plan.
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | $R$ | $\frac{|R|}{\mathcal P}$ | $O(1)$ |
Projection | $\pi(R)$ | $0$ | $O(1)$ |
Selection | $\sigma(R)$ | $0$ | $O(1)$ |
Union | $R \uplus S$ | $0$ | $O(1)$ |
Sort (In-Mem) | $\tau(R)$ | $0$ | $O(|R|)$ |
Sort (On-Disk) | $\tau(R)$ | $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | $O(\mathcal B)$ |
(B+Tree) Index Scan | $Index(R, c)$ | $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | $O(1)$ |
(Hash) Index Scan | $Index(R, c)$ | $1$ | $O(1)$ |
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer $S$ in mem) | $R \times S$ | $0$ | $O(|S|)$ |
Nested Loop Join (Buffer $S$ on disk) | $R \times_{disk} S$ | $(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | $O(1)$ |
1-Pass Hash Join | $R \bowtie_{1PH, c} S$ | $0$ | $O(|S|)$ |
2-Pass Hash Join | $R \bowtie_{2PH, c} S$ | $\frac{2|R| + 2|S|}{\mathcal P}$ | $O(1)$ |
Sort-Merge Join | $R \bowtie_{SM, c} S$ | [Sort] | [Sort] |
(Tree) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | $O(1)$ |
(Hash) Index NLJ | $R \bowtie_{INL, c}$ | $|R| \cdot 1$ | $O(1)$ |
(In-Mem) Aggregate | $\gamma_A(R)$ | $0$ | $adom(A)$ |
(Sort/Merge) Aggregate | $\gamma_A(R)$ | [Sort] | [Sort] |
Symbol | Parameter | Type |
---|---|---|
$\mathcal P$ | Tuples Per Page | Fixed ($\frac{|\text{page}|}{|\text{tuple}|}$) |
$|R|$ | Size of $R$ | Precomputed$^*$ ($|R|$) |
$\mathcal B$ | Pages of Buffer | Configurable Parameter |
$\mathcal I$ | Keys per Index Page | Fixed ($\frac{|\text{page}|}{|\text{key+pointer}|}$) |
$adom(A)$ | Number of distinct values of $A$ | Precomputed$^*$ ($|\delta_A(R)|$) |
* unless $R$ is a query
Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$
Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.
Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.
Operator | RA | Estimated Size |
---|---|---|
Table | $R$ | $|R|$ |
Projection | $\pi(Q)$ | $|Q|$ |
Union | $Q_1 \uplus Q_2$ | $|Q_1| + |Q_2|$ |
Cross Product | $Q_1 \times Q_2$ | $|Q_1| \times |Q_2|$ |
Sort | $\tau(Q)$ | $|Q|$ |
Limit | $\texttt{LIMIT}_N(Q)$ | $N$ |
Selection | $\sigma_c(Q)$ | $|Q| \times \texttt{SEL}(c, Q)$ |
Join | $Q_1 \bowtie_c Q_2$ | $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ |
Distinct | $\delta_A(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Aggregate | $\gamma_{A, B \leftarrow \Sigma}(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Idea 1: Assume each selection filters down to 10% of the data.
no... really!
$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$
$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$
... but remember that all we need is to rank plans.
Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.
(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND
clause, and 75% for every subsequent clause)
We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...
COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)If necessary statistics aren't available (point 1), fall back to the 10% rule.
If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.
$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$
$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$
$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$
$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$
$min_A(\pi_{A, \ldots}(R)) = min_A(R)$
$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$
$min_A(R \times S) = min_A(R)$ or $min_A(S)$
$min_A(R \uplus S) = min(min_A(R), min_A(S))$
Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)
Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)
$P(A = x_1)$ | $=$ | $\frac{1}{\texttt{COUNT(DISTINCT A)}}$ |
$P(A \in (x_1, x_2, \ldots, x_N))$ | $=$ | $\frac{N}{\texttt{COUNT(DISTINCT A)}}$ |
$P(A \leq x_1)$ | $=$ | $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |
$P(x_1 \leq A \leq x_2)$ | $=$ | $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |
$P(A = B)$ | $=$ | $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ |
$P(c_1 \wedge c_2)$ | $=$ | $P(c_1) \cdot P(c_2)$ |
$P(c_1 \vee c_2)$ | $=$ | $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$ |
(With constants $x_1$, $x_2$, ...)
...but handles most usage patterns