Alice wants to analyze two unaligned time series.
Time | Reading |
---|---|
1575731001 | 0 |
1575731014 | 0 |
1575731030 | 0 |
1575731035 | 0 |
... | |
1575731219 | 1 |
1575731229 | 1 |
1575731240 | 1 |
Time | Reading |
---|---|
1575731011 | 0 |
1575731020 | 0 |
1575731031 | 0 |
1575731039 | 0 |
... | |
1575731218 | 1 |
1575731228 | 1 |
1575731237 | 1 |
Step 1: Line up the readings
Lots of active research efforts! |
... but Alice is trying to to GSD! |
INSERT INTO series_one_buckets
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading)
FROM series_one
GROUP BY bucket;
Interpolate missing values
Hand tune around the switchover as-needed
Time taken: < 30 minutes
Similar analysis...
... different data
Can Bob re-use Alice's prep+analytics workflow?
... and even then, some manual effort is needed!
Bob needs to know Alice's assumptions
(and how to use the workflow)?
Carol gets a dataset from Dave
Dave adds new data to the dataset!
Can Carol re-use her workflow?
Carol needs to remember her assumptions about the data and trust that the new data is like the old data
Eve needs to load a CSV file
→
I'm sorry, I can't do that, Eve.
You have a non-numerical value at position 1252538:24.
Load Successful!
(btw, 175326 records didn't load)
Heuristics only work most of the time.
Data science is nuanced.
Assumptions can't be avoided!
It's easy to miss an assumption when re-using work.
... this is what Bob saw:
... this is what Carol saw:
⚠ |
The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic' |
Annotate data with warnings.
If you use this value/record,
here's what you need to know!
... this is what Eve saw:
A brief digression...
One database $D$
Each query gets one answer $R \leftarrow Q(D)$
Multiple possible databases $D \in \mathcal D$
(possible worlds)
Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$
Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$
Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$
(not limited to set semantics)
A caveat is an assumption tied to one or more data elements (cells or rows).
If the assumption is wrong, so is the element.
An element has a caveat → The element is uncertain.
... and btw, here's why.
SELECT setting_1, setting_2, estimate
FROM Simulation;
We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.
caveat(value, assumption)
returns value, annotated with assumption.
SELECT setting_1, setting_2,
caveat(estimate, 'Only correct if P ≠ NP')
AS estimate
FROM Simulation;
annotation is just a human-readable string.
caveat() creates 2 sets of possible worlds:
Mark multi-valued buckets (key repair).
SELECT bucket,
CASE WHEN bucket_size > 1 THEN
caveat(reading, 'Picked between two bucket values.')
ELSE reading END AS reading
FROM (
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading) AS reading
COUNT(*) AS bucket_size
FROM sensor
GROUP BY bucket;
)
Interpolation is more complex... but similar.
Mark unexpected values the model wasn't trained on.
SELECT
CASE WHEN race_ethnicity
IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
THEN race_ethnicity
ELSE caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
END, /* ... */
FROM R
This check can be automated.
SELECT /* ... */,
CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
Another brief digression...
Provenance in Databases: Why, How, and Where
James Cheney, Laura Chiticariu and Wang-Chiew Tan
MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
Floris Geerts, Anastasios Kementsietsidis, Diego Milano
and more...
CREATE VIEW Q AS
SELECT R.A AS X,
R.B+R.C AS Y
FROM R
$$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$
$$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$
CREATE VIEW Q AS
SELECT R.A AS X,
SUM(R.B) AS Y
FROM R
$$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$
$$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$
... not the semantics we want
Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$.
Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.
... i.e., elements unaffected by the choice of possible world.
If a caveatted element can't affect an output element, don't propagate its caveats!
Propagate caveats to any data elements that could be affected by a change in assumptions.
Challenge: How do we propagate caveats
without penalizing query evaluation?
Don't!
≅ computing certain answers! (CoNP-Complete)
Correctness of SQL Queries on Databases with Nulls.
Paolo Guagliardo, Leonid Libkin
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy
Add and maintain a binary "has caveat"
column for each row/column.
CREATE VIEW survey_responses AS
SELECT language,
CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
CREATE VIEW survey_responses AS
SELECT language, CAST(salary AS float) AS salary,
FALSE AS _caveat_field_language,
CAST(salary as float) IS NULL AS _caveat_field_salary
FALSE AS _caveat_row
FROM raw_csv_data;
SELECT salary
FROM survey_responses
WHERE language = 'Scala'
SELECT salary,
_caveat_field_salary AS _caveat_field_salary,
_caveat_row AND _caveat_field_language AS _caveat_row
FROM survey_responses
WHERE language = 'Scala'
SELECT AVG(salary) AS salary
FROM survey_responses
SELECT AVG(salary),
GROUP_OR(_caveat_field_salary
OR _caveat_row) AS _caveat_field_salary,
FALSE AS _caveat_row
FROM survey_responses
SELECT language, AVG(salary) AS salary
FROM survey_responses
GROUP BY language
SELECT GROUP_OR(_caveat_field_language)
FROM survey_responses
Can often be evaluated statically.
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
TRUE AS _caveat_field_salary
GROUP_AND(_caveat_field_language OR
_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
GROUP_OR(_caveat_field_salary,
_caveat_row) AS _caveat_field_salary
GROUP_AND(_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
What calls to caveat() appear in the derivation of the specified element?
Analogous to program slicing.
Eliminate lines of code not relevant
to computing a specific value.
This is exactly what a database optimizer does.
Lookup: Caveats on $\texttt{R.A}[i]$
SELECT A
FROM R
WHERE ROWID = i
All calls to caveat() surviving optimization
(probably) affect the target.
WITH data_source AS
SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
C, D, E
FROM R
SELECT C, D, E FROM data_source WHERE ROWID = i
becomes
SELECT 'valid if '& B &' is within tolerances.'
AS caveat_message
FROM R WHERE ROWID = i
✔ | SQL |
✔ | R (sort of) |
🗶 | Spreadsheets |
🗶 | Python |
The Exception That Improves The Rule
Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Mueller
Spreadsheet Operations → SQL DDL / SQL DML
Ok... so we have an edit history in DDL/DML.
Using Reenactment to Retroactively Capture Provenance for Transactions
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic
Graceful database schema evolution: the PRISM workbench
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo
UPDATE R SET A = 'foo' WHERE ROWID = 3;
becomes
SELECT CASE ROWID
WHEN 3 THEN 'foo'
ELSE A END AS A,
B, C, /* ... */
FROM R
INSERT INTO R() VALUES ();
becomes
SELECT * FROM R
UNION ALL
SELECT NULL AS A, NULL AS B,
NULL AS C, /* ... */
ALTER TABLE R ADD COLUMN `bar`;
becomes
SELECT *, NULL as `bar` FROM R;
$> pip3 install --user vizier-webapi
$> vizier
Students | ||||
---|---|---|---|---|
Poonam |
Will |
Aaron |
Dev |
---|
Mike |
Alumni | ||||||
---|---|---|---|---|---|---|
Ying |
Niccolò |
Arindam |
Shivang |
Olivia |
Gourab |
External Collaborators | |||||
---|---|---|---|---|---|
Zhen Hua Liu (Oracle) |
Ying Lu (Oracle) |
Beda Hammerschmidt (Oracle) |
Boris Glavic (IIT) |
Su Feng (IIT) |
Juliana Freire (NYU) |
Heiko Mueller (NYU) |
Sonia Castelo Quispe (NYU) |
Carlos Bautista (NYU) |
Remi Rampin (NYU) |
Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle