April 26, 2019
Why does a tuple appear in a query output?
The answer can be summarized by a polynomial.
|
|
$Q(1) = ae + af + bg$
$Q(1) = ae + af + bg$
$a = b = e = f = g = T \rightarrow$ $T$
$Q(1) = ae + af + bg$
$\{a = 3; b = 1; e = 1; f = 1; g = 3 \} \rightarrow$ $12$
$Q(1) = ae + af + bg$
$\{a = t_1; b = t_2; e = t_5; f = t_6; g = t_7 \}\rightarrow$
$\{\{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\}\}$
With $\left< [0, 5], \min, \max, 0, 5 \right>$
(0 = minimum security, 5 = top secret)
$Q(1) = ae + af + bg$
$\{a = 5; b = 1; e = 2; f = 3; g = 2 \}\rightarrow$$2$
Design | GProM | SMOKE |
---|---|---|
Representation | Relational | Specialized |
Implementation | Rewriting | Instrumentation |
Generation | Lazy + Eager | Eager + Hybrid |
|
|
|
|
Extra columns store annotations
PROVENANCE OF SELECT A FROM R NATURAL JOIN S
We'll be using bag-relational algebra.
$R \rightarrow \pi_{R.*, \phi \leftarrow \texttt{ROWID}}(R)$
$\pi_A(R) \rightarrow ?$
$\pi_{A, \phi}(R)$
$\sigma_c(R) \rightarrow ?$
$\sigma_c(R)$
$R \times S \rightarrow ?$
$\rho_{\phi \rightarrow \phi_1} (R) \times \rho_{\phi \rightarrow \phi_2} (S)$
|
|
$R \bowtie S$ | A | B | C | $\phi_1$ | $\phi_2$ |
---|---|---|---|---|---|
1 | 2 | 1 | $R_1$ | $S_1$ | |
1 | 2 | 2 | $R_1$ | $S_2$ | |
1 | 3 | 3 | $R_2$ | $S_3$ | |
2 | 3 | 3 | $R_3$ | $S_3$ |
$\pi_A(R) \rightarrow \pi_{A, \phi_1, \ldots, \phi_N}(R)$
$R \times S \rightarrow ?$
$R$ (resp., $S$) has $M$ (resp., $N$) annotation columns.
$R \times \rho_{\phi_{1} \rightarrow \phi_{M+1}, \ldots, \phi_{N} \rightarrow \phi_{M+N}} (S)$
(Rename $S$'s columns, appending to $R$'s)
$R \uplus S \rightarrow ?$
If $R$ has $M$ annotation columns
and $S$ has $N > M$ columns:
$\pi_{R.*, \phi_{M+1} \leftarrow \texttt{NULL}, \ldots, \phi_{N} \leftarrow \texttt{NULL}} (R) \uplus S$
(Pad the narrower relation with $\texttt{NULL}$s)
|
| ||||||||||||||||||||||||||||||||||||
|
$(R \bowtie S) \uplus T$ | A | B | C | $\phi_1$ | $\phi_2$ |
---|---|---|---|---|---|
1 | 2 | 1 | $R_1$ | $S_1$ | |
1 | 2 | 2 | $R_1$ | $S_2$ | |
1 | 3 | 3 | $R_2$ | $S_3$ | |
2 | 3 | 3 | $R_3$ | $S_3$ | |
2 | 1 | 4 | $T_1$ | NULL | |
4 | 5 | 5 | $T_2$ | NULL |
Many-to-one operators
Idea: Construct provenance by combining multiple rows.
$Q_1$ | A | B | C | $\phi_1$ | $\phi_2$ |
---|---|---|---|---|---|
1 | 2 | 1 | $R_1$ | $S_1$ | |
1 | 2 | 2 | $R_1$ | $S_2$ | |
1 | 3 | 3 | $R_2$ | $S_3$ | |
2 | 3 | 3 | $R_3$ | $S_3$ | |
2 | 1 | 4 | $T_1$ | NULL | |
4 | 5 | 5 | $T_2$ | NULL |
$\delta\big(\pi_{A}(Q_1)\big)$?
$\delta\big(\pi_{A}(Q_1)\big)$ | A | $\phi_1$ | $\phi_2$ |
---|---|---|---|
1 | $R_1$ | $S_1$ | |
1 | $R_1$ | $S_2$ | |
1 | $R_2$ | $S_3$ | |
2 | $R_3$ | $S_3$ | |
2 | $T_1$ | NULL | |
4 | $T_2$ | NULL |
$\delta R \rightarrow ?$
$R$
$_A\gamma_{SUM(B)}(R) \rightarrow ?$
$\big(_A\gamma_{SUM(B)}(R)\big) \bowtie_{A} \big( \pi_{A, \phi_1, \ldots, \phi_{M}} R\big)$
$_A\gamma_{COUNT}(Q_1)$ (lhs only) | A | COUNT |
---|---|---|
1 | 3 | |
2 | 2 | |
4 | 1 |
$_A\gamma_{COUNT}(Q_1)$ | A | COUNT | $\phi_1$ | $\phi_2$ |
---|---|---|---|---|
1 | 3 | $R_1$ | $S_1$ | |
1 | 3 | $R_1$ | $S_2$ | |
1 | 3 | $R_2$ | $S_3$ | |
2 | 2 | $R_3$ | $S_3$ | |
2 | 2 | $T_1$ | NULL | |
4 | 1 | $T_2$ | NULL |
Does this work? ... not quite
Is it a true provenance polynomial?
Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)
$Q_1'$ | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → $R_1 \wedge S_1$ | |
1 | 2 | 2 | → $R_1 \wedge S_2$ | |
1 | 3 | 3 | → $R_2 \wedge S_3$ | |
2 | 3 | 3 | → $R_3 \wedge S_3$ | |
2 | 1 | 4 | → $T_1$ | |
4 | 5 | 5 | → $T_2$ |
Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)
$Q_1'$ | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → $F \wedge S_1$ | |
1 | 2 | 2 | → $F \wedge S_2$ | |
1 | 3 | 3 | → $T \wedge S_3$ | |
2 | 3 | 3 | → $T \wedge S_3$ | |
2 | 1 | 4 | → $T$ | |
4 | 5 | 5 | → $T$ |
Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)
$Q_1'$ | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → $F$ | |
1 | 2 | 2 | → $F$ | |
1 | 3 | 3 | → $T$ | |
2 | 3 | 3 | → $T$ | |
2 | 1 | 4 | → $T$ | |
4 | 5 | 5 | → $T$ |
$Q_1'$ | A | B | C | $\phi_1$ | $\phi_2$ |
---|---|---|---|---|---|
1 | 3 | 3 | $R_2$ | $S_3$ | |
2 | 3 | 3 | $R_3$ | $S_3$ | |
2 | 1 | 4 | $T_1$ | NULL | |
4 | 5 | 5 | $T_2$ | NULL |
$_A\gamma_{COUNT}(Q_1')$ | A | COUNT | ||
---|---|---|---|---|
1 | 1 | $R_2$ | $S_3$ | |
2 | 2 | $R_3$ | $S_3$ | |
2 | 2 | $T_1$ | NULL | |
4 | 1 | $T_2$ | NULL |
vs
$_A\gamma_{COUNT}(Q_1)$ | A | COUNT | ||
---|---|---|---|---|
1 | 3 | $R_2$ | $S_3$ | |
2 | 2 | $R_3$ | $S_3$ | |
2 | 2 | $T_1$ | NULL | |
4 | 1 | $T_2$ | NULL |
Not quite a provenance polynomial...
...but still correct for lineage queries
(and there are other solutions)
Pro: Standardized, relational representation
Con: Huge blow-up in data size (aggregation/distinct)
... but it works better for lazy evaluation
SELECT DISTINCT A, COUNT
FROM (PROVENANCE OF Q_1)
WHERE phi_1 = 'R_1'
Observation 1: predicate on $\phi$ can be pushed down.
Observation 2: DISTINCT
would be a no-op (group-by attributes are always a key) if not for provenance additions.
This query can be made to run very fast!
The power of the relational representation is that it can be queried and optimized using an existing database. Everything is just a query.
Goal 1: Find the input tuples that produced a given output tuple (backward query).
Goal 2: Find the output tuples that result from a given input tuple (forward query).
Combine for interactive visualizations
Basic Idea: Each operator emits two structures:
$\textbf{id}_{in} \rightarrow \{ \textbf{id}_{out} \}$ and $\textbf{id}_{out} \rightarrow \{ \textbf{id}_{in} \}$
A "Forward index" and a "Backward index"
Again, we'll be using bag-relational algebra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
$_A\gamma_{COUNT}\big((R \bowtie S) \bowtie T\big) \rightarrow \\ \textbf{fwd}_{\bowtie_1}, \textbf{fwd}_{\bowtie_2}, \textbf{fwd}_{\gamma}, \textbf{back}_{\bowtie_1}, \textbf{back}_{\bowtie_2}, \textbf{back}_{\gamma}$
What tuples went in to $\left< A: 1, COUNT: 3 \right>$
(output tuple 1)?
$$\pi_{\textbf{back}_{\bowtie_2}.in}\big(\big( (\textbf{back}_{\gamma} \bowtie_{\textbf{back}_{\gamma}.in = \textbf{back}_{\bowtie_2}.out} \textbf{back}_{\bowtie_2})$$ $$\bowtie_{\textbf{back}_{\bowtie_2}.in = \textbf{back}_{\bowtie_1}.out} \textbf{back}_{\bowtie_1}\big)\big)$$