IVM Ordering Constraints Aren't Transitive
This has been simmering for a few months now, but there's a really weird phenomenon that Victoria and I noticed about ordering constraints in distributed IVM that deserves to be mentioned somewhere (whilst we plot a full publication). In a distributed IVM maintenance system, "happens before" constraints are not transitive. A slightly less bombastic way of phrasing this might be that in a distributed IVM system, all transitive "happens before" constraints are exposed as pairwise constraints.
Transactions
Let's start by being a bit clearer about what problem we're talking about. When reasoning about distributed transaction processing, it's customary to abstract away the precise nature of the data. Instead of talking about tables, rows, variables, etc..., you simply think of abstract values ($A$, $B$, $C$, etc...). Similarly, each transaction $T$ represents some abstract computation; all we know about it is the set of values it reads (the "read set" $R(T)$) and the set of values it writes (the "write set" $W(T)$). Values are versioned, typically with some timestamp assigned to the transaction. So if $A$ is in $T_{42}$s write set, after $T_{42}$ finishes, we have $A$ version 42, or $A_{42}$. If $A$ isn't in $T_{42}$s write set, then it inherits its value from the prior version (i.e., $A_{42} = A_{41}$).
Typically, we think of transactions as running sequentially. Given some collection of transactions $T_1, \ldots, T_N$, we execute them one at a time. I'll write this as $$(A_{i+1}, B_{i+1}, \ldots) = T_{i+1}(A_i, B_i, \ldots)$$ Of course this is boring, and not very distributed... but it gives us a baseline for correctness that we call a serial ordering. We have the flexibility to execute the transactions however we want, as long as we can guarantee that we'll end up with the same final state $A_N, B_N, \ldots$.
The interesting thing to us today is parallel execution: That is, let's say that we have two transactions: $T_{41}$ and $T_{42}$. Normally, $T_{42}$ would be able to see all of the values written by $T_{41}$ (aka the "effects" of $T_{41}$). However, if we run the two in parallel, $T_{42}$ starts before $T_{41}$ has had a chance to do anything. That is, instead of computing: $$(A_{41}, B_{41}, \ldots) = T_{41}(A_{40}, B_{40}, \ldots)$$ $$(A_{42}, B_{42}, \ldots) = T_{42}(A_{41}, B_{41}, \ldots)$$ ... we instead compute (note the inputs to $T_{42}$): $$(A_{41}, B_{41}, \ldots) = T_{41}(A_{40}, B_{40}, \ldots)$$ $$(A_{42}, B_{42}, \ldots) = T_{42}(A_{40}, B_{40}, \ldots)$$ Here we're discarding the outputs of $T_{41}$, but we can fix that by copying values in the write set (i.e., $W(T_{41})$) over from version 41 to version 42 after the two transactions have finished. This gives us a final result (at version 42) that is equivalent to what we would have gotten in the serial order if two things are true:
- $W(T_{41}) \cap W(T_{42}) = \emptyset$ (neither transaction overwrites the effects of the other).
- $W(T_{41}) \cap R(T_{42}) = \emptyset$ (the earlier transaction doesn't write anything that the latter one would have read). Usually, we don't care about the order in which transactions are executed relative to each other, so we also have one more dimension of freedom: We can reorder the transactions by swapping $T_{41}$ and $T_{42}$ This reordering saves us on the second condition if instead $W(T_{42}) \cap R(T_{41}) = \emptyset$ So... if the two transactions do not write to the same values, and at least one can be executed without reading the effects of the other, we're good to go.
Let's see an example. Say that we have two transactions:
def T1(state):
state.B += state.C
def T2(state):
state.A += state.B
state = {
A : 10,
B : 20,
C : 30
}
Here T1 reads from and writes to B, while T2 reads from A and B and writes to just A.
If we run these in parallel, T1 sees B = 20 and C = 30, and so produces a final state where B = 50.
Meanwhile, T2 sees A = 10 and B = 20, and so produces a final state where A = 30.
The write sets are disjoint, so we can combine the final states to get A = 30 and B = 50.
This is not the same as running the transactions in order, but is equivalent to running T1 on the output of T2, and so we're satisfied that we can safely run the two transactions in parallel.
If the read set of one transaction contains values in the write set of another, we say that the former transaction "happens before" the latter transaction, or sometimes that there is a conflict from the former to the latter. A conflict does not necessarily indicate a problem, but does force us to think about the ordering of the two transactions when trying to figure out an equivalent serial order. This problem gets tricky if we introduce a third transaction:
def T3(state):
state.C += state.A
Now we have a problem: We can run any two of these transactions in parallel, but if we try to run all three in parallel, we end up with a situation where:
T1happens beforeT3T2happens beforeT1T3happens beforeT2There is no way that we can order all three transactions that makes this happens before relationship make sense. So... we can not run all three of these transactions in parallel.
IVM
For our purposes here, incremental view maintenance is a class of database workload where we react to a change in a database by updating a set of materialized views that depend on the updated tables. There's a ton of work on this type of workload, including our own DBToaster, and Draupnir, and I won't get into the details here. At a high level, each change to the input database triggers a transaction that (i) updates a materalized view, and (ii) updates any supplemental state. For example, let's say we want to maintain the query $Q = R \bowtie S \bowtie T$ with respect to a record newly inserted into $R$. We obviously need to maintain (and update) $Q$, but then might also want to keep a supplemental materialized view that precomputes $S \bowtie T$ to make it more efficient to compute updates to $Q$. So now, transactions triggered by updates to $R$ read from the supplemental view, and write to $Q$. Meanwhile, transactions triggered by updates to $S$ write to $Q$ as well as the supplemental view.
A second quirk that shows up in this setting: Updates are algebraic. That is to say, IVM systems like DBToaster update tables by producing collections of additive deltas. In other words, even if we have updates that come from two different transactions to the same resource, DBToaster and similar systems give us a way to merge those updates together. Thus, in this class of workload, we are only concerned with transactions that read values potentially written by another transaction (read-write conflicts).
The final feature of IVM workloads that we care about, is that these transactions are, by design, order independent. The materialized view, and all of the intermediate state is necessarily defined as a deterministic function over the raw data. Thus, after applying the effects of an update to R and an update to S, the order in which these two updates were applied really does not matter. It's critical that these updates be applied in a way that ensures consistency pairwise, but the final effects are order-independent.
To summarize, IVM workloads have the following properties:
- Writes can be applied in any order (write/write conflicts never happen)
- Transaction effects are order-independent
The neat observation here is that any transactional workload with these properties can only ever have pairwise conflicts. Consider the following, with three state variables (A, B, C), each modified by one transaction
$$(A_3, x, x) = T_1(A_0, B_0, C_0)$$ $$(x, B_3, x) = T_2(A_0, B_0, C_0)$$ $$(x, x, C_3) = T_3(A_0, B_0, C_0)$$
So now we want to show that there is a way that we can order these that produces a result equivalent to a serial ordering over the transactions. In other words, if we apply the transactions in increasing order, we should get:
$$(A_3, x, x) = T_1(A_0, B_0, C_0)$$ $$(x, B_3, x) = T_2(T_1(A_0, B_0, C_0))$$ $$(x, x, C_3) = T_3(T_2(T_1(A_0, B_0, C_0)))$$
So first off, observe that $T_1$ and $T_2$ can be run safely in parallel since $T_1$ reads from $T_2$, but not visa versa (i.e., $T_1$ "happens before" $T_2$).
$$(x, B_3, x) = T_2(T_1(A_0, B_0, C_0)) = T_2(A_0, B_0, C_0)$$
The neat thing is that we can take this a step further. Because the result is order independent, we can actually treat this final result as arising from the two operations in reverse order! That is, if we were to run $T_2$ and then $T_1$, we would also get exactly the same $A_3$, and $B_3$! So in other words, we get:
$$(x, x, C_3) = T_3(T_2(T_1(A_0, B_0, C_0))) = T_3(T_1(T_2(A_0, B_0, C_0)))$$
Now we're free to reorder $T_3$ with the other two operations, and the entire computation is legitimate!
Discussion
This feels a bit nonsensical: We're able to show that a set of operations produce a result equivalent to a serial order that can not exist. Our suspicion (not proven yet) is that order independent transactions, such as those you see with IVM, can not create these sorts of cycles. That is, if any set of three transactions is going to have a conflict together, then we should be able to detect this conflict solely by examining them pairwise.
Even so, this result is neat! For any two transactions that conflict, we don't need to do expensive cycle detection to get a guaranteed optimal schedule! More to come...