Don't Wrangle, Guess

Don't Wrangle, Guess Instead

with

Students

Poonam
(PhD-3Y)

Will
(PhD-2Y)

Aaron
(PhD-3Y)

Shivang
(MS-2Y)

Olivia
(BS-Sr)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Dev

Mike
(Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick
(Oracle)
Zhen Hua Liu
(Oracle)
Ronny Fehling
(Airbus)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Wolfgang Gatterbauer
(NEU)
Heiko Mueller
(NYU)
Remi Rampin
(NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

  • File may not have column headers
  • CSV does not provide "types"
  • Lines may be missing fields
  • Fields may be mistyped (typo, missing comma)
  • Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

UNION two data sources

  • Schema matching
  • Deduplication
  • Format alignment (GIS coordinates, $ vs €)
  • Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a SELECT on JSON or a Doc Store

  • Separating fields and record sets:
    (e.g., { A: "Bob", B: "Alice" })
  • Missing fields (Records with no 'address')
  • Type alignment (Records with 'address' as an array)
  • Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Loading requires curation...

Data Curation is Hard!

State of the Art

(skilledup.com)

Alice spends weeks curating her data before using it.

Relational databases make this worse...

The data needs...

  • ... a complete schema (e.g., Tables, Columns, Types, ...).
  • ... to satisfy constraints (e.g., NOT NULL, Key, F-Key).

This is all required upfront. Before asking a single question.

Relational DBs are useless in early stages of curation.

Why?

In the name of Codd,
thou shalt not give the user a wrong answer.

There are tons of good heuristics available for guessing how to clean data.

Thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

            

My phone is guessing, but is letting me know that it did

Apple iOS 10; Phone App

Good Explanations, Alternatives, and Feedback Vectors

Bing Translate (c.a. 2016)

Communication

  • What data is uncertain?
  • Why is my data uncertain?
  • How bad is it?
  • What can I do about it?

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Incomplete and Probabilistic Databases
have existed since the 1980s

Q(D) Q(D) Q(D) Q(D) ? Probability Expectation Variance Histogram

We've gotten good at query processing on uncertain data.
But not sourcing uncertain data ... or communicating results to humans.

Challenges

  • Where do probabilities/possible worlds come from?
  • How do humans use the output of probabilistic queries?
  • Probablistic DB queries are sloooooow.

A small shift in how we think about PDBs addresses all three points.

It's not the data that's uncertain,
it's the interpretation

TimeSensor ReadingTemp Around Sensor
131.6Roughly 31.6˚C
2-999Around 30˚C?
328.1Roughly 28.1˚C?
432.2Roughly 32.2˚C

The reading is deterministic

... but what we care about is what the reading measures

Q1(D) Q2(D) Q3(D) Q4(D)

Insight: Treat data as 100% deterministic.

Instead, queries propose alternative interpretations.

ratings1.rating matches either ratings2.numratings or ratings2.evaluation.


	SELECT pid, rating FROM ratings1 UNION ALL
	SELECT pid, num_ratings AS rating FROM ratings2;
					
or

	SELECT pid, rating FROM ratings1 UNION ALL
	SELECT pid, evaluation AS rating FROM ratings2;
					

Repair missing values in rating


 SELECT pid, CASE WHEN rating is null 
             THEN interpolate(...) ELSE rating END AS rating 
 FROM ratings;
					
or

 SELECT pid, CASE WHEN rating is null 
             THEN classifier(...) ELSE rating END AS rating 
 FROM ratings;
					
or ...

Effects

  1. It's clear where uncertainty comes from.
  2. Results can be communicated through provenance.
  3. Query evaluation is decoupled from physical layout.

Non-Deterministic Queries

Q(D) Q1(D) 1 Q2(D) 2 Q1(D) 1

Non-deterministic queries reference an external configuration.

(OpenClipArt.org)

VGTerms

A $VGTerm(\ldots)$ references configuration parameters
(aka "variables").

  • $VGTerm('X')$ references the variable $X$
  • $VGTerm('X', 1)$ references the variable $X_{1}$
  • $VGTerm('X', B)$ evaluates $B$ and then references the variable $X_{B}$
Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015

$VGTerm()$s can be used like normal expressions


                  SELECT A, VGTerm('X', B) AS C FROM R;
					
RAB
12
34
54
Q(R)AC
1$X_2$
3$X_4$
5$X_4$
 

... variables are identified by a family (i.e. $'X'$),
and optional indexes (i.e., $B$).

Schema Matching

$$ratings2(pid, num\_ratings, evaluation) \rightarrow (pid, rating)$$

 SELECT 
    pid, 
    CASE VGTerm('MATCH_RATING') 
      WHEN 'NUM_RATINGS' THEN num_ratings
      WHEN 'EVALUATION'  THEN evaluation
      ELSE null
    END AS rating
 FROM ratings2;
					

One global configuration variable decides which column gets mapped to "rating".

Missing Value Imputation

$$ratings1(pid, rating, review\_ct) \text{ s.t. } rating \text{ is not NULL}$$

 SELECT 
    pid, 
    CASE WHEN rating IS NULL
         THEN VGTerm('RATING', ROWID) 
         ELSE rating
      END AS rating,
    review_ct
 FROM ratings1;
					

A family of variables indexed by ROWID represent each imputed value.

Defining Configurations

Config. Model Model Model All assignments for one family. Description of the family in English. Other feasible assignments. Config. Config. (Best)

Models designate one "best-guess" configuration.

Example Models

  • Imputation using a SparkML classifier
  • Heuristic detection of order-by columns for interpolation
  • Schema matching based on edit-distance
  • MayBMS-style probabilistic repair-key
  • And more...

Convenience Operators: Lenses

Lenses instantiate/train a model and wrap a query

  • Domain Constraint Repair / Missing Value Imputation †
  • Schema Matching †
  • Sequence Repair
  • Key Repair
  • Arbitrary Choice
  • Type Detection *
  • Header Detection *
  • JSON Shredder *
†Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015
*Adaptive Schema Databases; Spoth et. al.; CIDR 2017

Probabilistic ETL

ETL: Extract/Transform/Load

One big query that gets you to a clean dataset

Challenge: Designing ETL pipelines can be a full-time job.

Mimir starts with the default "guess" configuration.

As users explore, they validate or refine guesses for configuration variables as necessary.

Useful Provenance Questions

  1. How much of my query result is affected by unvalidated variables?
  2. Which variables affect my query results?
  3. How bad is the situation?

Provenance Question 1

How much of my query result is affected by unvalidated variables?

Idea: Mark values in query results that depend on unvalidated variables.

Communicating Data Quality in On-Demand Curation; Kumari et. al.; QDB 2016

Non-Determinism Taint


 SELECT A, VGTerm('X', ROWID) AS B FROM R;
					
↓    ↓    ↓    ↓

 SELECT A, VGTerm('X', ROWID) AS B, 
        FALSE AS ROW_TAINTED,
        FALSE AS A_TAINTED,
        TRUE AS B_TAINTED
  FROM R;
					

The Mimir compiler adds *_TAINTED fields to each row.

Non-Determinism Taint

A row is untainted if...
... we can guarantee that it (or a counterpart) appears in the result regardless of configuration.
A cell is untainted if...
... we can guarantee that its value in the result is independent of the configuration.

Non-Determinism Taint


 SELECT A, CASE WHEN B IS NULL 
                THEN VGTerm('X', ROWID) 
                ELSE B END AS B 
 FROM R;
					
↓    ↓    ↓    ↓

 SELECT A, CASE WHEN B IS NULL 
                THEN VGTerm('X', ROWID) 
                ELSE B END AS B, 
        FALSE AS ROW_TAINTED, FALSE AS A_TAINTED,
        (B IS NULL) AS B_TAINTED
  FROM R;
					

Expressions with VGTerms can be conditionally tainted.


 CREATE VIEW R_CLEANED AS
    SELECT A, CASE WHEN B IS NULL 
                   THEN VGTerm('X', ROWID) 
                   ELSE B END AS B 
    FROM R;

 SELECT A, SUM(B) AS B FROM R_CLEANED GROUP BY A;
					
↓    ↓    ↓    ↓

 SELECT A, SUM(B) AS B, 
        FALSE AS A_TAINTED,
        GROUP_OR(B_TAINTED OR ROW_TAINTED) 
          OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
        GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
 FROM R_CLEANED;
					

Aggregates work too!

Taint Benefits

  • Much faster than classical Probabilistic DBs
    (comparable to deterministic queries).
  • At-a-glance visual of how bad your data is.
  • Can help to focus subsequent analysis.

Taint Limitations

  • Taint is (probably *) C-Sound, but (usually *) not C-Complete.
  • Taint on group-by aggregates can be misleading.
  • Taint does not work well with set difference.

In spite of this, taint works well in practice.

*Ongong work w/ Su Feng, Aaron Huber, Boris Glavic

Provenance Question 2

Which variables affect my query results?

Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.

Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV

Provenance Question 3

How bad is the situation?

Idea: Sample from the space of alternatives to...

  • Estimate error, expectations, or other statistical measures.
  • Highlight other possible query results.
  • Compute sensitivity (Kanagal & Deshpande; SIGMOD 2011)

Sampling is slooooow

Trivial Sampling

Evaluate the query $N$ times.
Plug in samples instead of best guesses.

Better Solutions

Merge evaluation to mitigate redundancy.

Sparse Encoding

$R_1$AB
12
34
$R_2$AB
15
$R_{sparse}$ABS#
121
341
152

Tuple Bundles

$R_1$AB
12
34
$R_2$AB
15
$R_{bundle}$AB$\phi$
1[2,5][T,T]
34[T,F]
  • Tuple Bundles faster on Aggregates
  • Sparse Evaluation faster on Non-Deterministic Joins.

Which one to use?

Either!

Mimir isn't committed to one fixed data representation.

(optimization is a work in progress)

Demo


http://mimirdb.info

  • It's not the data that's uncertain, it's the interpretation.
  • Tagged best-guess evaluation is faster and easier to understand.
  • Not committing to one representation allows faster query processing.

Thanks!