CSE-4/562 Spring 2019 - Provenance

Provenance

CSE-4/562 Spring 2019

April 24, 2019

Textbook: Readings Only
  • Why is this tuple in my query result?
  • Why is this tuple not in my query result?
  • Which datasets were used to create this value?
  • How does this input affect my query output?

Provenance

How does the input data relate to a query output.

Types of Provenance

Why Provenance (Lineage)
What's the smallest fragment of my input needed to produce some row
Why-Not Provenance
What's the least I can add to my input to get a desired row
How Provenance
An execution trace of the result; How were the tuples combined?
Where Provenance
Which cell(s) was a given output value taken from
Taint
Was the output affected by any "tainted" input cell/row
RAB
112
213
323
424
SBC
125
226
336

Why is $\left<1\right>$ in $\pi_A (R \bowtie S)$?

RAB
112
SBC
125

... but that's not the only reason

RAB
112
213
323
424
SBC
125
226
336

Why is $\left<1\right>$ in $\pi_A (R \bowtie S)$?

$\left\{ R_1, S_1 \right\}$, $\left\{ R_1, S_2 \right\}$, $\left\{ R_2, S_3 \right\}$

Witness: Any subset of the original database that still produces the same result.
(Generally we only want 'minimal' witnesses)

RAB
112
213
323
424
SBC
125
226
336

How is $\left<1\right>$ derived in $\pi_A (R \bowtie S)$?

$(R_1 \bowtie S_1) \oplus (R_1 \bowtie S_2) \oplus (R_2 \bowtie S_3)$

Outline

  • Datalog
  • Reasoning about provenance
  • Implementing provenance queries
  • Cool things to do with provenance

Datalog

[head] :- [body]

$$Q(A) :-~~ R(A, B), S(B, C)$$

like SELECT A FROM R NATURAL JOIN S

A
Head Variable (appears in the head and body)
B, C
Existential Variables (appear only in the body)

Stop thinking about relations as collections of records, and instead think of them as collections of facts

RAB
112
213
323
424

The fact $R(1, 2)$ is true.

The fact $R(2, 1)$ is false (or unknown).

A table contains all facts that are provably true.

$$Q(A) :-~~ R(A, B), S(B, C)$$

For any $A$, the fact $Q(A)$ is true if...
  • there is some $B$ and $C$ for which...
  • the fact $R(A, B)$ is true, and...
  • the fact $S(B, C)$ is true.

$\forall A : \big( \exists B, C : R(A, B) \wedge S(B, C) \big) \rightarrow Q(A)$

$$Q(A) :-~~ R(A, B), S(B, C)$$ $$Q(A) :-~~ R(A, B), R(B, C)$$

Treat multiple rules as a disjunction.
($Q(A)$ is true if any rule is satisfied)

As powerful as Set-RA

Projection
$Q := \pi_A(R)$
$Q(A) :-~~ R(A, \ldots)$
Union
$Q := R \cup S$
$Q(\ldots) :-~~ R(\ldots)$
$Q(\ldots) :-~~ S(\ldots)$
Join
$Q := R \bowtie S$
$Q(\ldots) :-~~ R(\ldots), S(\ldots)$
Selection (Equality)
$Q := \sigma_{R.A = R.B}(R)$
$Q(A) :-~~ R(A, A)$
Selection (Equality')
$Q := \sigma_{R.A = 1}(R)$
$Q(B) :-~~ R(1, B)$
Selection (Other)
$Q := \sigma_{A > B}(R)$
$Q(A,B) :-~~ R(A, B), [[ A > B ]]$
$[[ A > B ]]$AB
10
20
30
...
21
...

Relations are Sets of Facts. We can have a relation consisting of all pairs $A, B$ where $A$ is bigger.

A Finite Relation
... declares a finite number of true facts
An Infinite Relation
... declares an infinite number of true facts

Safety Property: Every variable must appear in at least one finite relation in a rule body.

Recursion

Recursive datalog: The body can reference the head atom

$$Q(A, B) :-~~ R(A, B)$$ $$Q(A, C) :-~~ Q(A, B), R(B, C)$$

(~Dijkstra's algorithm)

Datalog, Top-Down

$$Q(A, C) :-~~ R(A, B), S(B, C)$$

... is like a very large number of queries with no head variables

$$Q_{1, 1}() :-~~ R(1, B), S(B, 1)$$ $$Q_{1, 2}() :-~~ R(1, B), S(B, 2)$$ $$Q_{1, 3}() :-~~ R(1, B), S(B, 3)$$

...

The fact $Q(1, 1)$ is true if $\exists B : R(1, B) \wedge S(B, 1)$

Think of the relation as a function from potential facts to their truthiness.

RAB
112→ T
213→ T
323→ T
424→ T
511→ F
6...→ F

Every row not explicitly listed is mapped to False