SELECT
on a raw CSV FileState of the art: External Table Defn + "Manually" edit CSV
UNION
two data sourcesState of the art: Manually map schema
SELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)State of the art: DataGuide, Wrangler, etc...
Alice spends weeks cleaning her data before using it.
Add structure, curation effort On-Demand
My phone is guessing, but is letting me know that it did
Microsoft Image ID tells you something (and how sure it is)
Easy interactions to accept, reject, or explain uncertainty
Here's a problem with my data. Fix it.
Lenses introduce uncertainty
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Simple UI: Highlight values that are based on guesses.
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Allow users to EXPLAIN
uncertain outputs
Explanations include reasons given in English
$PRODUCTS.DEPARTMENT_{3}$
"I guessed 'Computer' for 'Department' on Row '3'"
Mimir virtualizes uncertainty
$Var(\ldots)$ constructs new variables
Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.
A lazy value without variables is deterministic
Mimir SQL allows the $Var()$ operator to inlined
SELECT A, VAR('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
Selects on $Var()$ need to be deferred too...
SELECT A FROM R WHERE VAR('X', B) > 2;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | $\phi$ |
---|---|
1 | $X_2>2$ |
3 | $X_4>2$ |
5 | $X_6>2$ |
When evaluating the table, rows where $\phi = \bot$ are dropped.
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
is (almost) the same as the query...
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
ID | Name | ... | Department |
---|---|---|---|
123 | Apple 6s, White | ... | Phone |
34234 | Dell, Intel 4 core | ... | Computer |
34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |
... | ... | ... | ... |
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
Behind the scenes, a lens also creates a model...
SELECT * FROM PRODUCTS_RAW;
An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$
SELECT A, VAR('X', B)+2 AS C FROM R;
Mimir dispatches this query to the DB:
SELECT A, B FROM R;
And for each row of the result, evaluates:
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.
VAR('X', B)+2
.VAR('X', B)
.
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
And for each row of the result, evaluates:
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
Helper views allow the DB to interpret labeled nulls
SELECT R.A, S.C FROM R, S
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
... but we lose the ability to explain outputs
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
Then to explain, Mimir dispatches the query:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
TPC-H Data, but replace 0.1% of FK references with NULL. Ask Mimir to fix.
(a worst case from a performance standpoint)
Mimir over SQLite in 4 different execution modes.
100% = Zero overhead
Questions?
Lenses implement PC-Tables
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
AS
clause defines source data.USING
clause requests repairs.
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
Recall, row-level uncertainty is a boolean formula $\phi$.
For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$
Too many variables! Which is the most important?
Data Cleaning
The ones that keep us from knowing everything
$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$
$$A \wedge (B \vee C)$$
Consider a game between a database and an impartial oracle.
Naive Algorithm: Pick all variables!
Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.
$$\phi = A \wedge (B \vee C)$$
Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.
$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$
Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.
$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$
Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.
Combine Information Gain and Cost
$$f(\mathcal I_{A}(\phi), c_A)$$
For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$
Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step
Simulate an analyst trying to manually explore correlations.
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order