Embracing Uncertainty

Don't Wrangle, Guess Instead

with

Students
Poonam (PhD-3Y)	Will (PhD-2Y)	Aaron (PhD-3Y)	Shivang (MS-2Y)	Olivia (BS-Sr)

Alumni
Ying (PhD 2017)	Niccolò (PhD 2016)	Arindam (MS 2016)

Dev
Mike (Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick (Oracle)	Zhen Hua Liu (Oracle)	Ronny Fehling (Airbus)	Beda Hammerschmidt (Oracle)

Boris Glavic
(IIT)

Su Feng
(IIT)

Juliana Freire
(NYU)

Wolfgang Gatterbauer
(NEU)

Heiko Mueller
(NYU)

Remi Rampin
(NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Loading requires curation...

Data Curation is Hard!

State of the Art

(skilledup.com)

Alice spends weeks curating her data before using it.

Relational databases make this worse...

The data needs...

... a complete schema (e.g., Tables, Columns, Types, ...).
... to satisfy constraints (e.g., NOT NULL, Key, F-Key).

This is all required upfront. Before asking a single question.

Relational DBs are useless in early stages of curation.

Why?

In the name of Codd,
thou shalt not give the user a wrong answer.

There are tons of good heuristics available for guessing how to clean data.

Thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

My phone is guessing, but is letting me know that it did

Apple iOS 10; Phone App

Good Explanations, Alternatives, and Feedback Vectors

Bing Translate (c.a. 2016)

Communication

What data is uncertain?
Why is my data uncertain?
How bad is it?
What can I do about it?

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Incomplete and Probabilistic Databases
have existed since the 1980s

We've gotten good at query processing on uncertain data.
But not sourcing uncertain data ... or communicating results to humans.

Challenges

Where do probabilities/possible worlds come from?
How do humans use the output of probabilistic queries?
Probablistic DB queries are sloooooow.

A small shift in how we think about PDBs addresses all three points.

It's not the data that's uncertain,
it's the interpretation

Time	Sensor Reading	Temp Around Sensor
1	31.6	Roughly 31.6˚C
2	-999	Around 30˚C?
3	28.1	Roughly 28.1˚C?
4	32.2	Roughly 32.2˚C

The reading is deterministic

... but what we care about is what the reading measures

Insight: Treat data as 100% deterministic.

Instead, queries propose alternative interpretations.

ratings1.rating matches either ratings2.numratings or ratings2.evaluation.


	SELECT pid, rating FROM ratings1 UNION ALL
	SELECT pid, num_ratings AS rating FROM ratings2;


	SELECT pid, rating FROM ratings1 UNION ALL
	SELECT pid, evaluation AS rating FROM ratings2;

Repair missing values in rating


 SELECT pid, CASE WHEN rating is null 
             THEN interpolate(...) ELSE rating END AS rating 
 FROM ratings;


 SELECT pid, CASE WHEN rating is null 
             THEN classifier(...) ELSE rating END AS rating 
 FROM ratings;

or ...

Effects

It's clear where uncertainty comes from.
Results can be communicated through provenance.
Query evaluation is decoupled from physical layout.

Non-Deterministic Queries

Non-deterministic queries reference an external configuration.

(OpenClipArt.org)

VGTerms

A $VGTerm(\ldots)$ references configuration parameters
(aka "variables").

$VGTerm('X')$ references the variable $X$
$VGTerm('X', 1)$ references the variable $X_{1}$
$VGTerm('X', B)$ evaluates $B$ and then references the variable $X_{B}$

Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015

$VGTerm()$s can be used like normal expressions


                  SELECT A, VGTerm('X', B) AS C FROM R;

R	A	B
	1	2
	3	4
	5	4

Q(R)	A	C
	1	$X_2$
	3	$X_4$
	5	$X_4$

... variables are identified by a family (i.e. $'X'$),
and optional indexes (i.e., $B$).

Schema Matching

$$ratings2(pid, num\_ratings, evaluation) \rightarrow (pid, rating)$$


 SELECT 
    pid, 
    CASE VGTerm('MATCH_RATING') 
      WHEN 'NUM_RATINGS' THEN num_ratings
      WHEN 'EVALUATION'  THEN evaluation
      ELSE null
    END AS rating
 FROM ratings2;

One global configuration variable decides which column gets mapped to "rating".

Missing Value Imputation

$$ratings1(pid, rating, review\_ct) \text{ s.t. } rating \text{ is not NULL}$$


 SELECT 
    pid, 
    CASE WHEN rating IS NULL
         THEN VGTerm('RATING', ROWID) 
         ELSE rating
      END AS rating,
    review_ct
 FROM ratings1;

A family of variables indexed by ROWID represent each imputed value.

Defining Configurations

Models designate one "best-guess" configuration.

Example Models

Imputation using a SparkML classifier
Heuristic detection of order-by columns for interpolation
Schema matching based on edit-distance
MayBMS-style probabilistic repair-key
And more...

Convenience Operators: Lenses

Lenses instantiate/train a model and wrap a query

Domain Constraint Repair / Missing Value Imputation †
Schema Matching †
Sequence Repair
Key Repair
Arbitrary Choice
Type Detection *
Header Detection *
JSON Shredder *

†Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015
*Adaptive Schema Databases; Spoth et. al.; CIDR 2017

Probabilistic ETL

ETL: Extract/Transform/Load

One big query that gets you to a clean dataset

Challenge: Designing ETL pipelines can be a full-time job.

Mimir starts with the default "guess" configuration.

As users explore, they validate or refine guesses for configuration variables as necessary.

Useful Provenance Questions

How much of my query result is affected by unvalidated variables?
Which variables affect my query results?
How bad is the situation?

Provenance Question 1

How much of my query result is affected by unvalidated variables?

Idea: Mark values in query results that depend on unvalidated variables.

Communicating Data Quality in On-Demand Curation; Kumari et. al.; QDB 2016

Non-Determinism Taint


 SELECT A, VGTerm('X', ROWID) AS B FROM R;

↓ ↓ ↓ ↓


 SELECT A, VGTerm('X', ROWID) AS B, 
        FALSE AS ROW_TAINTED,
        FALSE AS A_TAINTED,
        TRUE AS B_TAINTED
  FROM R;

The Mimir compiler adds *_TAINTED fields to each row.

Non-Determinism Taint

A row is untainted if...: ... we can guarantee that it (or a counterpart) appears in the result regardless of configuration.
A cell is untainted if...: ... we can guarantee that its value in the result is independent of the configuration.

Non-Determinism Taint


 SELECT A, CASE WHEN B IS NULL 
                THEN VGTerm('X', ROWID) 
                ELSE B END AS B 
 FROM R;

↓ ↓ ↓ ↓


 SELECT A, CASE WHEN B IS NULL 
                THEN VGTerm('X', ROWID) 
                ELSE B END AS B, 
        FALSE AS ROW_TAINTED, FALSE AS A_TAINTED,
        (B IS NULL) AS B_TAINTED
  FROM R;

Expressions with VGTerms can be conditionally tainted.


 CREATE VIEW R_CLEANED AS
    SELECT A, CASE WHEN B IS NULL 
                   THEN VGTerm('X', ROWID) 
                   ELSE B END AS B 
    FROM R;

 SELECT A, SUM(B) AS B FROM R_CLEANED GROUP BY A;

↓ ↓ ↓ ↓


 SELECT A, SUM(B) AS B, 
        FALSE AS A_TAINTED,
        GROUP_OR(B_TAINTED OR ROW_TAINTED) 
          OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
        GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
 FROM R_CLEANED;

Aggregates work too!

Taint Benefits

Much faster than classical Probabilistic DBs
(comparable to deterministic queries).
At-a-glance visual of how bad your data is.
Can help to focus subsequent analysis.

Taint Limitations

Taint is (probably *) C-Sound, but (usually *) not C-Complete.
Taint on group-by aggregates can be misleading.
Taint does not work well with set difference.

In spite of this, taint works well in practice.

*Ongong work w/ Su Feng, Aaron Huber, Boris Glavic

Provenance Question 2

Which variables affect my query results?

Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.

Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV

Provenance Question 3

How bad is the situation?

Idea: Sample from the space of alternatives to...

Estimate error, expectations, or other statistical measures.
Highlight other possible query results.
Compute sensitivity (Kanagal & Deshpande; SIGMOD 2011)

Sampling is slooooow

Trivial Sampling

Evaluate the query $N$ times.
Plug in samples instead of best guesses.

Better Solutions

Merge evaluation to mitigate redundancy.

Sparse Encoding

$R_1$	A	B
	1	2
	3	4

$R_2$	A	B
	1	5

➔

A	B	S#
1	2	1
3	4	1
1	5	2

Tuple Bundles

$R_1$	A	B
	1	2
	3	4

$R_2$	A	B
	1	5

➔

$R_{bundle}$	A	B	$\phi$
	1	[2,5]	[T,T]
	3	4	[T,F]

Tuple Bundles faster on Aggregates
Sparse Evaluation faster on Non-Deterministic Joins.

Which one to use?

Either!

Mimir isn't committed to one fixed data representation.

(optimization is a work in progress)

Demo

http://mimirdb.info

It's not the data that's uncertain, it's the interpretation.
Tagged best-guess evaluation is faster and easier to understand.
Not committing to one representation allows faster query processing.

Thanks!

Don't Wrangle, Guess Instead

with

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Two Datasets

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Data Curation is Hard!

State of the Art

Relational databases make this worse...

Why?

In the name of Codd,thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

Communication

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Challenges

It's not the data that's uncertain,it's the interpretation

Effects

Non-Deterministic Queries

VGTerms

Schema Matching

Missing Value Imputation

Defining Configurations

Example Models

Convenience Operators: Lenses

Probabilistic ETL

ETL: Extract/Transform/Load

Useful Provenance Questions

Provenance Question 1

Non-Determinism Taint

Non-Determinism Taint

Non-Determinism Taint

Taint Benefits

Taint Limitations

Provenance Question 2

Provenance Question 3

Sampling is slooooow

Trivial Sampling

Better Solutions

Sparse Encoding

Tuple Bundles

Either!

Demo

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store

In the name of Codd,
thou shalt not give the user a wrong answer.

It's not the data that's uncertain,
it's the interpretation