Embracing uncertainty with

Joint work with:

Ying Yang, Niccolo Meneghetti, Arindam Nandi,
Vinayak Karuppasamy (now at Bloomberg), Ronny Fehling (Airbus),
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle),
Boris Glavic (IIT), Wolfgang Gatterbauer (CMU)
(and soon Juliana Freire (NYU))

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)

(timoelliott.com)

Structure is hard!

Structured models (RelDBs) force curation during loading.
- Problem: All curation costs are upfront.
Unstructured models (NoSQL) force curation into queries.
- Problem: Complexity/redundancy blowup in queries.

Add structure, curation effort On-Demand

But... you still need some sort of structure?!?

Let the database make a guess!

In the name of Codd,
thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

My phone is guessing, but is letting me know that it did

Microsoft Image ID tells you something (and how sure it is)

Easy interactions to accept, reject, or explain uncertainty

Communication

Why is my data uncertain?
How bad is it?
What can I do about it?

What if a database did the same?

A: Standard SQL.
B: Annotated Output.
C: Lens Diagram.
D: Result Explanations.

Mimir

Lenses: Generic, best-guess data curation operators.
Explanations: How certain is my data?
Provenance: What issues still need to be fixed?

Lenses

Here's a problem with my data. Fix it.

What type is this column? (majority vote)
How do the columns of these relations line up? (pick your favorite schema matching paper)
How do I query heterogeneous JSON objects? (see above)
What should these missing values be? (learning-based interpolation)

Lenses introduce uncertainty

(OpenClipArt.org)

The User's View


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Simple UI: Highlight values that are based on guesses.


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Allow users to EXPLAIN uncertain outputs

Explanations include reasons given in English

$PRODUCTS.DEPARTMENT_{3}$

⬍

"I guessed 'Computer' for 'Department' on Row '3'"

Explanations

Mark uncertain data and results.
Upon request, provide more detail:
- Why is my data uncertain? (provenance)
- How bad is it? (confidence, entropy, bounds)
- What are other possibile answers? (samples)
- What can I do to fix it? (repairs)

Mimir is a DB Overlay

Mimir virtualizes uncertainty (OpenClipArt.org)

How?

Labeled Nulls

$Var(\ldots)$ constructs new variables

$Var('X')$ constructs a new variable $X$
$Var('X', 1)$ constructs a new variable $X_{1}$
$Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Lazy Evaluation

Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.

$X$ is a legitimate data value.
$X+1$ is a legitimate data value.
$1+1$ is a legitimate data value, but can be reduced to $2$.

A lazy value without variables is deterministic

Mimir SQL allows the $Var()$ operator to inlined


                  SELECT A, VAR('X', B)+2 AS C FROM R;

A	B
1	2
3	4
5	6

A	C
1	$X_2+2$
3	$X_4+2$
5	$X_6+2$

Selects on $Var()$ need to be deferred too...


                  SELECT A FROM R WHERE VAR('X', B) > 2;

A	B
1	2
3	4
5	6

A	$\phi$
1	$X_2>2$
3	$X_4>2$
5	$X_6>2$

When evaluating the table, rows where $\phi = \bot$ are dropped.

C-Tables

Original Formulation [Imielinski, Lipski 1981]
PC-Tables [Green, Tannen 2006]
Systems
- Orchestra [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]
- MayBMS [Huang, Antova, Koch, Olteanu 2009]
- Pip [Kennedy, Koch 2009]
- Sprout [Fink, Hogue, Olteanu, Rath 2011]
Generalized PC-Tables [Kennedy, Koch 2009]

Labeled nulls capture a lens' uncertainty


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

is (almost) the same as the query...


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;

ID	Name	...	Department
123	Apple 6s, White	...	Phone
34234	Dell, Intel 4 core	...	Computer
34235	HP, AMD 2 core	...	$Prod.Dept_3$
...	...	...	...


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

Behind the scenes, a lens also creates a model...


                      SELECT * FROM PRODUCTS_RAW;

↓

An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$

... but databases don't support labeled nulls

Labeled Nulls Percolate Up


                  SELECT A, VAR('X', B)+2 AS C FROM R;

Mimir dispatches this query to the DB:


                          SELECT A, B FROM R;

And for each row of the result, evaluates:


               SELECT A, VAR('X', B)+2 AS C FROM RESULT;

Generating Explanations

All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.

Why is the data uncertain?: All relevant lenses referenced in VAR('X', B)+2.
How uncertain?: Estimate by sampling from VAR('X', B).
How do I fix it?: Each lens fixes one well-defined type of error.

Lazy evaluation can cause problems


        SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;

Mimir dispatches this query to the DB:


      SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;

And for each row of the result, evaluates:


      SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;

Helper views allow the DB to interpret labeled nulls


      SELECT R.A, S.C FROM R, S
      WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);

... but we lose the ability to explain outputs

Provenance Recovers Explanations


        SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;

Mimir dispatches this query to the DB:


      SELECT R.A, S.C, 
             R.ROWID AS ID_1, S.ROWID AS ID_2
      WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);

Then to explain, Mimir dispatches the query:


      SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
      WHERE R.ROWID = ID_1 AND S.ROWID = ID_2

Performance

TPC-H Data, but replace 0.1% of FK references with NULL. Ask Mimir to fix.

(a worst case from a performance standpoint)

Query 1: Table scan. Overhead for a no-op.
Query 3: 3-way join on an FK chain.
Query 5: 6-way join on an FK tree.
Query 9: 6-way join with cycles.

Mimir over SQLite in 4 different execution modes.
100% = Zero overhead

On-Demand Data Curation makes data exploration easier.
"Best-Guess" results streamline analytics.
... if the DB communicates the resulting uncertainty.

Questions?

Backup Slides

C-Tables

Original Formulation [Imielinski, Lipski 1981]
PC-Tables [Green, Tannen 2006]
Systems
- Orchestra [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]
- MayBMS [Huang, Antova, Koch, Olteanu 2009]
- Pip [Kennedy, Koch 2009]
- Sprout [Fink, Hogue, Olteanu, Rath 2011]
Generalized PC-Tables [Kennedy, Koch 2009]

Lenses

A VG-RA Expression
A 'Model' that defines for each variable...
- A sampling process
- A best guess estimator
- A human-readable description

Lenses implement PC-Tables


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

AS clause defines source data.
USING clause requests repairs.

Selection (Filtering)


                  SELECT NAME FROM PRODUCTS
                  WHERE DEPARTMENT='PHONE' 
                     AND ( VENDOR='APPLE' 
                           OR PLATFORM='ANDROID' )

Recall, row-level uncertainty is a boolean formula $\phi$.

For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$

Too many variables! Which is the most important?

What is important?

Data Cleaning

Which variables are important?

The ones that keep us from knowing everything

$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$

⬍

$$A \wedge (B \vee C)$$

Naive Approach

Consider a game between a database and an impartial oracle.

The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.
The Oracle reveals the truth value of $v$.
The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.

Naive Algorithm: Pick all variables!

Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.

Exponential Time Bad!

The Value of What We Don't Know

$$\phi = A \wedge (B \vee C)$$

Generate Samples for $A$, $B$, $C$
Estimate $p(\phi)$
Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$

Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.

Information Gain

$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$

Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.

Expected Information Gain

$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$

Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.

The Cost of Perfect Information

Combine Information Gain and Cost

$$f(\mathcal I_{A}(\phi), c_A)$$

For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$

Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step

Experimental Data

Start with a large dataset.
Delete random fields (~50%).

Experimental Queries

Simulate an analyst trying to manually explore correlations.

Train a tree-classifier on the base data.
Convert the decision tree to a query for all rows where the tree predicts a specific value.

Cost vs Entropy: Credit Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order

Cost vs Entropy: Product Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order

Embracing uncertainty with

Ying Yang, Niccolo Meneghetti, Arindam Nandi, Vinayak Karuppasamy (now at Bloomberg), Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Boris Glavic (IIT), Wolfgang Gatterbauer (CMU) (and soon Juliana Freire (NYU))

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Two Datasets

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Data Cleaning is Hard!

State of the Art

Newer State of the Art

Structure is hard!

But... you still need some sort of structure?!?

Let the database make a guess!

In the name of Codd,thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

Communication

What if a database did the same?

Mimir

Lenses

The User's View

Explanations

Mimir is a DB Overlay

How?

Labeled Nulls

Lazy Evaluation

C-Tables

Labeled nulls capture a lens' uncertainty

... but databases don't support labeled nulls

Labeled Nulls Percolate Up

Generating Explanations

Lazy evaluation can cause problems

Provenance Recovers Explanations

Performance

Backup Slides

C-Tables

Lenses

Selection (Filtering)

What is important?

Which variables are important?

Naive Approach

Exponential Time Bad!

The Value of What We Don't Know

Information Gain

Expected Information Gain

The Cost of Perfect Information

Experimental Data

Experimental Queries

Cost vs Entropy: Credit Data

Cost vs Entropy: Product Data

Ying Yang, Niccolo Meneghetti, Arindam Nandi,
Vinayak Karuppasamy (now at Bloomberg), Ronny Fehling (Airbus),
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle),
Boris Glavic (IIT), Wolfgang Gatterbauer (CMU)
(and soon Juliana Freire (NYU))

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store

In the name of Codd,
thou shalt not give the user a wrong answer.