Students | |||||
---|---|---|---|---|---|
Poonam |
Will |
Aaron |
Shivang |
Olivia |
Alumni | ||
---|---|---|
Ying |
Niccolò |
Arindam |
Dev |
---|
Mike |
External Collaborators | |||
---|---|---|---|
Dieter Gawlick (Oracle) |
Zhen Hua Liu (Oracle) |
Ronny Fehling (Airbus) |
Beda Hammerschmidt (Oracle) |
Boris Glavic (IIT) |
Su Feng (IIT) |
Juliana Freire (NYU) |
Wolfgang Gatterbauer (NEU) |
Heiko Mueller (NYU) |
Remi Rampin (NYU) |
SELECT
on a raw CSV FileState of the art: External Table Defn + "Manually" edit CSV
UNION
two data sourcesState of the art: Manually map schema
SELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)State of the art: DataGuide, Wrangler, etc...
Loading requires curation...
Alice spends weeks curating her data before using it.
The data needs...
This is all required upfront. Before asking a single question.
Relational DBs are useless in early stages of curation.
There are tons of good heuristics available for guessing how to clean data.
Thou shalt not give the user a wrong answer.
My phone is guessing, but is letting me know that it did
Good Explanations, Alternatives, and Feedback Vectors
Incomplete and Probabilistic Databases
have existed since the 1980s
We've gotten good at query processing on uncertain data.
But not sourcing uncertain data
... or communicating results to humans.
A small shift in how we think about PDBs addresses all three points.
Time | Sensor Reading | Temp Around Sensor |
---|---|---|
1 | 31.6 | Roughly 31.6˚C |
2 | -999 | Around 30˚C? |
3 | 28.1 | Roughly 28.1˚C? |
4 | 32.2 | Roughly 32.2˚C |
The reading is deterministic
... but what we care about is what the reading measures
Insight: Treat data as 100% deterministic.
Instead, queries propose alternative interpretations.
ratings1.rating matches either ratings2.numratings or ratings2.evaluation.
SELECT pid, rating FROM ratings1 UNION ALL
SELECT pid, num_ratings AS rating FROM ratings2;
or
SELECT pid, rating FROM ratings1 UNION ALL
SELECT pid, evaluation AS rating FROM ratings2;
Repair missing values in rating
SELECT pid, CASE WHEN rating is null
THEN interpolate(...) ELSE rating END AS rating
FROM ratings;
or
SELECT pid, CASE WHEN rating is null
THEN classifier(...) ELSE rating END AS rating
FROM ratings;
or ...
Non-deterministic queries reference an external configuration.
A $VGTerm(\ldots)$ references configuration parameters
(aka "variables").
$VGTerm()$s can be used like normal expressions
SELECT A, VGTerm('X', B) AS C FROM R;
R | A | B |
---|---|---|
1 | 2 | |
3 | 4 | |
5 | 4 |
Q(R) | A | C |
---|---|---|
1 | $X_2$ | |
3 | $X_4$ | |
5 | $X_4$ |
... variables are identified by a family (i.e. $'X'$),
and optional indexes (i.e., $B$).
SELECT
pid,
CASE VGTerm('MATCH_RATING')
WHEN 'NUM_RATINGS' THEN num_ratings
WHEN 'EVALUATION' THEN evaluation
ELSE null
END AS rating
FROM ratings2;
One global configuration variable decides which column gets mapped to "rating".
SELECT
pid,
CASE WHEN rating IS NULL
THEN VGTerm('RATING', ROWID)
ELSE rating
END AS rating,
review_ct
FROM ratings1;
A family of variables indexed by ROWID represent each imputed value.
Models designate one "best-guess" configuration.
Lenses instantiate/train a model and wrap a query
One big query that gets you to a clean dataset
Challenge: Designing ETL pipelines can be a full-time job.
Mimir starts with the default "guess" configuration.
As users explore, they validate or refine guesses for configuration variables as necessary.
How much of my query result is affected by unvalidated variables?
Idea: Mark values in query results that depend on unvalidated variables.
SELECT A, VGTerm('X', ROWID) AS B FROM R;
↓ ↓ ↓ ↓
SELECT A, VGTerm('X', ROWID) AS B,
FALSE AS ROW_TAINTED,
FALSE AS A_TAINTED,
TRUE AS B_TAINTED
FROM R;
The Mimir compiler adds *_TAINTED fields to each row.
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B
FROM R;
↓ ↓ ↓ ↓
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B,
FALSE AS ROW_TAINTED, FALSE AS A_TAINTED,
(B IS NULL) AS B_TAINTED
FROM R;
Expressions with VGTerms can be conditionally tainted.
CREATE VIEW R_CLEANED AS
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B
FROM R;
SELECT A, SUM(B) AS B FROM R_CLEANED GROUP BY A;
↓ ↓ ↓ ↓
SELECT A, SUM(B) AS B,
FALSE AS A_TAINTED,
GROUP_OR(B_TAINTED OR ROW_TAINTED)
OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
FROM R_CLEANED;
Aggregates work too!
In spite of this, taint works well in practice.
Which variables affect my query results?
Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.
How bad is the situation?
Idea: Sample from the space of alternatives to...
Evaluate the query $N$ times.
Plug in samples instead of best guesses.
Merge evaluation to mitigate redundancy.
| ➔ |
|
| ➔ |
|
Which one to use?
Mimir isn't committed to one fixed data representation.
(optimization is a work in progress)
Thanks!