April 10, 2019
CREATE TABLE
sCREATE TABLE
issuedBoth phases in-memory
CREATE TABLE
s will include PRIMARY KEY
and other suggested indexes.Not universal, every system is different
$> java -Xrunhprof:cpu=samples dubstep.Main
or
$> java -Xrunhprof:cpu=time dubstep.Main
Sampling is faster, Time is more accurate
Output ends up in ./java.hprof.txt
...
... lots of gobblygook ...
...
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank self accum count trace method
1 53.17% 53.17% 67 300027 java.util.zip.ZipFile.getEntry
2 17.46% 70.63% 22 300135 java.util.zip.ZipFile.getNextEntry
3 5.56% 76.19% 7 300111 java.lang.ClassLoader.defineClass2
4 3.97% 80.16% 5 300140 java.io.UnixFileSystem.list
5 2.38% 82.54% 3 300149 java.lang.Shutdown.halt0
6 1.59% 84.13% 2 300136 java.util.zip.ZipEntry.initFields
7 1.59% 85.71% 2 300138 java.lang.String.substring
8 1.59% 87.30% 2 300026 java.util.zip.ZipFile.open
9 0.79% 88.10% 1 300118 com.sun.tools.javac.code.Type$ErrorType.<init>
10 0.79% 88.89% 1 300134 java.util.zip.ZipFile.ensureOpen
'self' column shows how much time is spent in that method.
'count' column shows number of samples/invocations
'trace' column links back to a stack trace of the sample/invocation
TRACE 300027:
java.util.zip.ZipFile.getEntry(ZipFile.java:Unknown line)
java.util.zip.ZipFile.getEntry(ZipFile.java:253)
java.util.jar.JarFile.getEntry(JarFile.java:197)
java.util.jar.JarFile.getJarEntry(JarFile.java:180)
Main Goal: Figure out where your code is spending its time, have it spend less time there
Idea: Minimize number of "intermediate" tuples
Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$
Operator | RA | Estimated Size |
---|---|---|
Table | $R$ | $|R|$ |
Projection | $\pi(Q)$ | $|Q|$ |
Union | $Q_1 \uplus Q_2$ | $|Q_1| + |Q_2|$ |
Cross Product | $Q_1 \times Q_2$ | $|Q_1| \times |Q_2|$ |
Sort | $\tau(Q)$ | $|Q|$ |
Limit | $\texttt{LIMIT}_N(Q)$ | $N$ |
Selection | $\sigma_c(Q)$ | $|Q| \times \texttt{SEL}(c, Q)$ |
Join | $Q_1 \bowtie_c Q_2$ | $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ |
Distinct | $\delta_A(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Aggregate | $\gamma_{A, B \leftarrow \Sigma}(Q)$ | $\texttt{UNIQ}(A, Q)$ |
Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.
(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND
clause, and 75% for every subsequent clause)
We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...
COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)If necessary statistics aren't available (point 1), fall back to the 10% rule.
If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.
$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$
$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$
$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$
$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$
$min_A(\pi_{A, \ldots}(R)) = min_A(R)$
$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$
$min_A(R \times S) = min_A(R)$ or $min_A(S)$
$min_A(R \uplus S) = min(min_A(R), min_A(S))$
Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)
Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)
$P(A = x_1)$ | $=$ | $\frac{1}{\texttt{COUNT(DISTINCT A)}}$ |
$P(A \in (x_1, x_2, \ldots, x_N))$ | $=$ | $\frac{N}{\texttt{COUNT(DISTINCT A)}}$ |
$P(A \leq x_1)$ | $=$ | $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |
$P(x_1 \leq A \leq x_2)$ | $=$ | $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ |
$P(A = B)$ | $=$ | $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ |
$P(c_1 \wedge c_2)$ | $=$ | $P(c_1) \cdot P(c_2)$ |
$P(c_1 \vee c_2)$ | $=$ | $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$ |
(With constants $x_1$, $x_2$, ...)