Embracing Uncertainty

Exploring **O**nline **D**ata **In**teractions

Arindam Nandi, Vinayak Karuppasamy

(UB)

(Oracle)

- Validating and Fixing Outliers
- Handling Missing Data
- Matching Schemas
- Fixing Schemas
- Managing Stale Data
- Deduplicating Records
- ... and lots more

Alice spends weeks cleaning her data before using it.

Can we start with automation and work our way up?

thou shalt not give the user a wrong answer.

- Automate educated guesses for fast cleaning
**Lenses**: A family of simple data-cleaning operators- ... but what if the guesses are wrong?

- Annotate 'best guess' relations with the guesses
**Virtual C-Tables**: A lineage model based on views, labeled nulls, and lazy evaluation.- ... so now the user needs to interpret your guesses?

- Rank guesses by their impact on result uncertainty
**CPI**: A greedy heuristic for ranking sources of uncertainty.

Here's a problem with my data. Fix it.

- What type is this column? (majority vote)
- How do the columns of these relations line up? (pick your favorite schema matching paper)
- How do I query heterogeneous JSON objects? (see above)
- What should these missing values be? (learning-based interpolation)

(Variable-Generating Relational Algebra)

- Relational Algebra
- Labeled Nulls
- Lazy Evaluation

$Var(\ldots)$ constructs new variables

- $Var('X')$ constructs a new variable $X$
- $Var('X', 1)$ constructs a new variable $X_{1}$
- $Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Variables can't be evaluated until they are bound.

So, we allow arbitrary expressions to be values.

- $X$ is a legitimate data value.
- $X+1$ is a legitimate data value.
- $1+1$ is a legitimate data value, but can be reduced to $2$.

A lazy value without variables is **deterministic**

The $Var()$ operator can be inlined into SQL

```
SELECT A, VAR('X', B)+2 AS C FROM R;
```

A | B |
---|---|

1 | 2 |

3 | 4 |

5 | 6 |

A | C |
---|---|

1 | $X_2+2$ |

3 | $X_4+2$ |

5 | $X_6+2$ |

Selects on $Var()$ need to be deferred too...

```
SELECT A FROM R WHERE VAR('X', B) > 2;
```

A | B |
---|---|

1 | 2 |

3 | 4 |

5 | 6 |

A | $\phi$ |
---|---|

1 | $X_2>2$ |

3 | $X_4>2$ |

5 | $X_6>2$ |

When evaluating the table, rows where $\phi = \bot$ are dropped.

- Original Formulation [Imielinski, Lipski 1981]
- PC-Tables [Green, Tannen 2006]
- Systems
- Orchestra [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]
- MayBMS [Huang, Antova, Koch, Olteanu 2009]
- Pip [Kennedy, Koch 2009]
- Sprout [Fink, Hogue, Olteanu, Rath 2011]

- Generalized PC-Tables [Kennedy, Koch 2009]

- A VG-RA Expression
- A 'Model' that defines for each variable...
- A sampling process
- A best guess estimator
- A human-readable description

**Lenses implement PC-Tables**

```
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
```

`AS`

clause defines source data.`USING`

clause requests repairs.

```
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
```

```
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
```

ID | Name | ... | Department |
---|---|---|---|

123 | Apple 6s, White | ... | Phone |

34234 | Dell, Intel 4 core | ... | Computer |

34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |

... | ... | ... | ... |

```
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
```

```
SELECT * FROM PRODUCTS_RAW;
```

↓

↓

An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$

```
SELECT NAME, DEPARTMENT FROM PRODUCTS;
```

Name | Department |
---|---|

Apple 6s, White | Phone |

Dell, Intel 4 core | Computer |

HP, AMD 2 core | Computer |

... | ... |

**Simple UI:** Highlight values that are based on guesses.

```
SELECT NAME, DEPARTMENT FROM PRODUCTS;
```

Name | Department |
---|---|

Apple 6s, White | Phone |

Dell, Intel 4 core | Computer |

HP, AMD 2 core | Computer |

... | ... |

Allow users to `EXPLAIN`

uncertain outputs

Explanations include reasons given in English

$PRODUCTS.DEPARTMENT_{3}$

⬍

"I guessed 'Computer' for 'Department' on Row '3'"

**
(Generalized) C-Tables are a form of lineage.
**

```
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
```

Recall, row-level uncertainty is a boolean formula $\phi$.

For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$

**Too many variables! Which is the most important?**

Data Cleaning

The ones that keep us from knowing everything

$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$

⬍

$$A \wedge (B \vee C)$$

Consider a game between a database and an impartial oracle.

- The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.
- The Oracle reveals the truth value of $v$.
- The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.

**Naive Algorithm: ** Pick all variables!

**Less Naive Algorithm: ** Minimize $E\left[\sum c_v\right]$.

$$\phi = A \wedge (B \vee C)$$

- Generate Samples for $A$, $B$, $C$
- Estimate $p(\phi)$
- Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$

**Entropy is intuitive: **

$H = 1$ means we know nothing,

$H = 0$ means we know everything.

$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$

**Information gain of** $v$: The reduction in entropy from knowing the truth value of a variable $v$.

$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$

**Expected information gain of** $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.

Combine Information Gain and Cost

$$f(\mathcal I_{A}(\phi), c_A)$$

**For example: **$EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$

**Greedy Algorithm: ** Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step

- Start with a large dataset.
- Delete random fields (~50%).

Simulate an analyst trying to manually explore correlations.

- Train a tree-classifier on the base data.
- Convert the decision tree to a query for all rows where the tree predicts a specific value.

**EG2:** Greedy Cost/Value Ordering

**NMETC:** Naive Minimal Expected Total Cost

**Random:** Completely Random Order

**EG2:** Greedy Cost/Value Ordering

**NMETC:** Naive Minimal Expected Total Cost

**Random:** Completely Random Order

**UB**: Ying Yang, Niccolo Meneghetti,

Arindam Nandi, Vinayak Karuppasamy

**Oracle**: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick