November 25
What is your age?
☐ 18-34
☐ 35-49
☐ 50-64
☐ 65+
☑ Prefer not to answer
Problem: You want to analyze a new dataset where specific values are missing or clearly wrong.
SELECT * FROM R WHERE X IS NOT NULL
$X_i = \begin{cases} X_i &\textbf{if } \exists X_i\\ \frac{X_{i+1}+X_{i-1}}{2}& \textbf{otherwise}\end{cases}$
Replace invalid values with the last valid value
Replace invalid values with the column mean
Replace invalid values with the column mean
Which to use?
Idea: Make the query optimizer aware of imputation needs.
Two new operators enforce COLUMN NOT NULL constraints.
Basic query optimization problem!
The user manually identifies "dirty" input table columns.
Intuition: An attribute is dirty if it contains nulls that must be repaired.
Constraint: A dirty attribute must be repaired before it is used (e.g., by a selection or aggregate).
Question: Which columns are dirty?
$$\textbf{Dirty}(R) = \text{The dirty columns of R}$$
$$\textbf{Dirty}(\pi_C(q)) = \textbf{Dirty}(q) \cap C$$
$$\textbf{Dirty}(\sigma(q)) = \textbf{Dirty}(q)$$
$$\textbf{Dirty}(q_1 \bowtie q_2) = \textbf{Dirty}(q_1) \cup \textbf{Dirty}(q_2)$$
$$\textbf{Dirty}(\delta_C(q)) = \textbf{Dirty}(\mu_C(q)) = \textbf{Dirty}(q) - C$$
Cost Model
↓
Original Query → Optimizer → Optimized Query
When a query subtree matches a specific pattern...
... (optionally) replace it with a different pattern.
Question: How to inject repair operators?
Challenge: Changes require re-evaluating the cost model.
Idea: Restrict variations in columns being repaired.
Before each operator...
($2\times 2 = 4$ options for each operator)
Optimize the query normally.
Try every plan that satisfies the constraint on dirty columns.
Optimize for precision and time.
Need to predict the computational cost of the plan.
Need to predict the accuracy of the result.
For both, we need histograms.
Propagating per-column histograms is a standard part of query optimizers
How do $\delta$ and $\mu$ propagate histograms.
Conservative assumption of full overlap between NULLs
Assume distribution of values unchanged
Given: Penalty function $P(\mu_C(q)) \in [0,1]$
0 = perfect replacement
1 = as bad as dropping all columns in $C$
Total penalties judged relative to a plan that drops all attributes.
Handled as normal by the query optimizer.
Later imputation means more opportunities to drop tuples naturally, but some joins produce larger output relations.
Larger data is sometimes better (more opportunities to find correlations), but not always.
Penalty function is measured per-operator rather than per-column.