The ODIn Lab - Those Marvelous Lifts and Exists (Part 2)

Last week, we started talking about using the Lift operation to express nested subqueries. I ended on a bit of a cliffhanger:

∂(X ^= A) = (X ^= A + ∂A) - (X ^= A)

There's something horribly wrong with this delta rule. The expression A appears intact, in its entirety in the delta rule (it actually appears not once, but twice). The delta of a lift is NOT simpler than the original.

Admittedly, for simple lifts, this isn't a problem. In particular, when ∂A = 0, then we get

∂(X ^= A) = (X ^= A) - (X ^= A) = 0

Which is, in fact simpler. But, once we start putting relation terms into the expression being lifted, we get something nasty. For example, let's say we wanted to compute the SQL query:

SELECT COUNT(*) FROM R WHERE (SELECT COUNT(*) FROM S) = R.A;

This translates to the following AGCA expression.

AggSum([], R(A) * (X ^= AggSum([], S(B))) * {X = A})

If we take the delta of this query with respect to the insertion S(1), we get:

AggSum([], R(A) * ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * {X = A})

Messy... and it really doesn't help us much. We could materialize this expression, but since the deltas aren't simpler, if we repeat the process recursively, we'll end up with an infinite number of materialized expressions. Not good.

We deal with this problem using partial materialization. First a little reorganization. The lifts commute with the relation term:

AggSum([], ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * R(A) * {X = A})

Now, rather than materializing the entire thing, we materialize the lifts separately. More precisely, rather than materializing the lifts, we materialize the expression being lifted.

AggSum([], ( (X ^= M(AggSum([], S(B) + (B ^= 1)))) - (X ^= M(AggSum([], S(B)))) ) * M(R(A) * {X = A}))

Remember our materialization operator M(). Let's call these new datastructures Q1[] (= AggSum([], S(B) + (B ^= 1)), Q2[] (= AggSum([], S(B))), and Q3[A] (= AggSum([A], R(A))). This gives us the expression

AggSum([], ((X ^= Q1[]) - (X ^= Q2[])) * Q3[A] * {X = A})

Of course, we can do better. If we were to applying the standard materialization optimization rules that we discussed several weeks ago to the expression AggSum([], S(B) + (B ^= 1)), we actually get two simpler expressions, one of which is constant, and the other of which is equivalent to Q2. Thus, our full materialization decision becomes

AggSum([], ((X ^= Q2[] + 1) - (X ^= Q2[])) * Q3[A] * {X = A})

And applying polynomial expansion, equality lifting and lift unification, we get the absolute simplest expression:

AggSum([], (X ^= Q2[] + 1) * Q3[X]) - AggSum([], (X ^= Q2[]) * Q3[X])

So that's it. Don't materialize deltas of lift expressions in their entirety. There are however, two corner cases that need to be considered. First, it's often more efficient to recompute the entire expression from scratch than it is to compute the delta. The precise definition of these cases is a bit subtle and nuanced, but basically, in any situation where there are no correlated variables (i.e., the example above), you're essentially computing the entire expression from scratch... twice. In these situations, it's entirely reasonable just to recompute the entire expression from scratch, but just once. If you make appropriate materialization decisions, it may still be possible to compute this in constant time.

Second, in some situations, it actually pays to materialize the delta along with the rest of the expression. For example, consider the query (note the inequality predicate):

SELECT COUNT(*) FROM R, T WHERE (SELECT COUNT(*) FROM S) < R.A AND R.C = T.C;

Or in its AGCA form:

AggSum([], R(A,C) * T(C) * (X ^= AggSum([], S(B))) * {X < A})

Consider the delta with respect to T(dC).

AggSum([], R(A,dC) * (X ^= AggSum([], S(B))) * {X < A})

You could materialize R and the S separately, but you'd end up needing to compute a full iteration over all of the elements of R (to evaluate the aggregate over an inequality predicate) on every insertion into T. Conversely, putting them together creates a new map that you need to maintain, but the new maps add only a constant factor cost to the time complexity of the existing maintenance costs.

Next week, I wrap up with my discussion of lifts, with some thoughts on a related operator (and the last operator in AGCA): The exists predicate.