Alice spends weeks cleaning her data before using it.
Can we start with automation and work our way up?
Here's a problem with my data. Fix it.
$Var(\ldots)$ constructs new variables
Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to be values.
A lazy value without variables is deterministic
The $Var()$ operator can be inlined into SQL
SELECT A, VAR('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
Selects on $Var()$ need to be deferred too...
SELECT A FROM R WHERE VAR('X', B) > 2;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | $\phi$ |
---|---|
1 | $X_2>2$ |
3 | $X_4>2$ |
5 | $X_6>2$ |
When evaluating the table, rows where $\phi = \bot$ are dropped.
Lenses implement PC-Tables
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
AS
clause defines source data.USING
clause requests repairs.
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
ID | Name | ... | Department |
---|---|---|---|
123 | Apple 6s, White | ... | Phone |
34234 | Dell, Intel 4 core | ... | Computer |
34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |
... | ... | ... | ... |
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
SELECT * FROM PRODUCTS_RAW;
An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Simple UI: Highlight values that are based on guesses.
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Allow users to EXPLAIN
uncertain outputs
Explanations include reasons given in English
$PRODUCTS.DEPARTMENT_{3}$
"I guessed 'Computer' for 'Department' on Row '3'"
(Generalized) C-Tables are a form of lineage.
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
Recall, row-level uncertainty is a boolean formula $\phi$.
For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$
Too many variables! Which is the most important?
Data Cleaning
The ones that keep us from knowing everything
$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$
$$A \wedge (B \vee C)$$
Consider a game between a database and an impartial oracle.
Naive Algorithm: Pick all variables!
Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.
$$\phi = A \wedge (B \vee C)$$
Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.
$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$
Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.
$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$
Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.
Combine Information Gain and Cost
$$f(\mathcal I_{A}(\phi), c_A)$$
For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$
Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step
Simulate an analyst trying to manually explore correlations.
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
UB: Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy
Oracle: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick