The ODIn Lab - The Viewlet Transform (Part 2: The Naive Viewlet Transform)

Last week, I introduced you to how deltas work in AGCA. To recap

∂T(A,B,C,…)       = {+/- 1} * (A ^= {X}) * (A ^= {Y}) * (A ^= {Z}) * …

∂S(…)             = {0}

∂(Q1 + Q2)        = ∂Q1 + ∂Q2

∂(Q1 * Q2)        = (Q1 * ∂Q2) + (∂Q1 * Q2) + (∂Q1 * ∂Q2)

∂(AggSum([…], Q)) = AggSum([…], ∂Q)

∂({…})            = {0}

∂(V ^= {…})       = {0} (for the simplified Lift operation only)

Now, we've been down in the nitty gritty of AGCA for a while now. Let's pop our heads up for a moment to remember where we're going with all of this. We have a query (let's call it Q) and we want to be able to incrementally maintain it. That is, we want to store a copy of the results of evaluating Q on a database (stored on disk, in memory, anywhere really), and every time the database changes in some way, we want to update the stored results to match.

Applying the Delta Transform

That's fairly easy to do somewhat efficiently if we have these deltas. Let's say we have the query

Q := AggSum([], R(A,B) * S(B,C) * A)

or in SQL

SELECT SUM(A) AS Q FROM R, S WHERE R.B = S.B;

Let's say R contains

___A__B______#__

 < 1, 1 > -> 1

 < 1, 2 > -> 1

 < 2, 2 > -> 1

and S contains

___B__C______#__

 < 1, 1 > -> 2

 < 2, 2 > -> 1

For this data, Q = 5. Let's say we insert a new row: S(2,1). We could certainly re-evaluate the entire query from scratch and discover that the new result is Q = 8, but this would be pretty inefficient. Even in the best case, where everything fits in memory, re-evaluating the join requires O(|R| + |S|) work. That's where the deltas come in. The delta of Q tells us how the query results change with respect to a change in the table. So let's take the delta of Q with respect to an insertion of tuple <@Y,@Z> into S.

∂Q := ∂(AggSum([], R(A,B) * S(B,C) * {A})

   := AggSum([], ∂(R(A,B) * S(B,C) * {A}))

   := AggSum([], R(A,B) * ∂(S(B,C) * {A})

                 + ∂R(A,B) * (S(B,C) * {A})

                 + ∂R(A,B) * ∂(S(B,C) * {A}))

   := AggSum([], R(A,B) * ∂(S(B,C) * {A)

                 + {0} * (S(B,C) * {A})

                 + {0} * ∂(S(B,C) * {A}))

   := AggSum([], R(A,B) * ∂(S(B,C) * {A}))

   := AggSum([], R(A,B) * (S(B,C) * ∂{A}

                           + ∂S(B,C) * {A}

                           + ∂S(B,C) * ∂{A}))

   := AggSum([], R(A,B) * (S(B,C) * {0}

                           + ((B ^= {@Y}) * (C ^= {@Z}) * {A})

                           + ∂S(B,C) * {0}))

   := AggSum([], R(A,B) * (B ^= {@Y}) * (C ^= {@Z}) * {A})

I'm not going to get into the details of optimizing AGCA expressions (yet), but trust me for now that the following (simpler) query is equivalent

∂Q := AggSum([], R(A,@Y) * {A})

or in SQL

SELECT SUM(A) FROM R WHERE R.B = @Y;

Note, by the way, that @Y is a parameter to this delta query. When you evaluate a delta query (for example our delta for insertions into S), these parameters take their value from the tuple being modified (so when you insert <2,1> into S, then @Y = 2). That said, @Y is just a normal variable/column. There's nothing special about it (other than the @ in the name). The delta query tells us how the query results change. If we insert <2,1> into S, then we evaluate the delta query for insertions into S (∂Q above), setting @Y to 2, and @Z to 1.

AggSum([], R(A, 2) * {A})

… which, for our initial dataset above, gives us ∂Q = 3. To figure out what Q will give us on the modified database (after inserting <2,1> into S), we just add ∂Q to our initial result (5 + 3 = 8).

Parameters and AggSums

I keep saying that parameters just normal variables, and that there's nothing special about them. That's mostly true. I actually oversimplified a bit on the delta rules. We want these parameters to be visible from the outside so that evaluating ∂Q for a specific insertion (or deletion) essentially amounts to selecting a single row from the output of ∂Q. In other words, the AggSums need to be rewritten slightly so that the parameters appear in the group-by variables (where appropriate). That is, the correct delta with respect to an insertion into S : +S(@Y,@Z) is

AggSum([@Y], R(A, @Y) * {A})

or in SQL

SELECT R.B, SUM(R.A) FROM R GROUP BY R.B;

That little hiccup out of the way, let's get to the actual viewlet transform

Auxiliary Views

The delta query is an improvement over evaluating the entire query from scratch. For this particular example though, we still need to scan over multiple rows of R (even if it is only a small subset of R). We can do even better. Right now, every time the database changes we update Q with ∂Q.

ON +S(@X, @Y)

  Q += AggSum([@Y], R(A,@Y) * {A})

But recall that for any AGCA query Q, ∂Q is just an ordinary, simple, unexceptional AGCA query (no funny business is introduced by the ∂). If we can store ('materialize' to use the technical term) the results of Q, what's to stop us from storing the results of ∂Q? Nothing! Let's say we had another view materialized (let's call it M_S), this time with a group by variable:

M_S[Y] := AggSum([Y], R(A, Y) * {A})

For the initial dataset above, this would contain

___Y______#__

 < 1 > -> 1

 < 2 > -> 3

This view can help us substantially when we need to update Q after an insertion into S. Expressing this update as a trigger:

ON +S(@Y, @Z)

  Q += M_S[@Y]

In other words, to update Q, we just need to look up one row of M_S. The update can be done in constant time! That said, we now have an extra view that we need to maintain. Fortunately, M_S is simpler than Q, and has only one table, in this case R. Whenever R changes, we need to update M_S. Since M_S is defined in terms of a normal, ordinary AGCA expression, we update it in exactly the same way that we update Q, using the delta of M_S. For an insertion of <@X,@Y> into R, this would be:

∂M_S[Y] := AggSum([Y], R(A,Y) * {A})

        := AggSum([@X, @Y, Y], (A ^= {@X}) * (Y ^= {@Y}) * {A})

        := (Y ^= {@Y}) * {@X}

Or, expressed as a trigger

ON +R(@X, @Y)

  M_S[Y] += (Y ^= {@Y}) * {@X}

This can be simplified a bit, since the update operation only produces one row

ON +R(@X, @Y)

  M_S[@Y] += {@X}

(also a constant time operation) Recursive Deltas What's happening here is that we're saving the results of Q and maintaining them with (several instantiations of) ∂Q. The key idea of the viewlet transform is that we can also save the results of ∂Q and maintain them with (several instantiations of) ∂(∂Q). This process repeats recursively, giving us ∂(∂(∂Q)), ∂(∂(∂(∂Q))), and so on. Every time we add another ∂, another table drops out, making it the delta query simpler. After enough repetitions, we end up with a query that doesn't depend on the database at all (e.g., ∂M_S above). At this point, we can stop, since the query can be evaluated in constant time. This is the viewlet transform. Start by materializing the original query, and then alternate computing its delta(s), and recursively materializing the delta(s). I've obscured the issue a bit by not subscripting my ∂s, remember that each delta is taken with respect to a particular event. Q has four deltas: for both insertion and deletion of both R and S (∂+R, ∂-R, ∂+S, ∂-S). Similarly, M_S has two deltas: for both the insertion and deletion of R and S. The viewlet transform of Q produces five views: one for Q, and one for each of the first-tier deltas (the second-tier deltas are all constants). As it turns out, we can be even more efficient than that! Furthermore, the procedure I describe above runs into problems with special tables (as a bit of a teaser for this, take a close look at the delta of Q with respect to insertions into R). This naive viewlet transform is insufficient. Next week, I'll start discussing some changes to the naive viewlet transform that make it more practical and efficient.