The ODIn Lab - The Viewlet Transform (Part 3: Input Variables and Partial Materialization)

For the past few weeks, I've been discussing the viewlet transform. The key idea of this process is that because the delta transform is closed over AGCA (that is, it doesn't add any funny business to the query), it's possible to materialize and incrementally maintain the deltas of a query just as easily as we can maintain the original query. Because the deltas of a query are materialized as their own views, practically no processing is required to incrementally maintain the query; we just read from the delta view and update the original query accordingly.

This process continues recursively. The delta queries each have their own deltas -- we'll call these second-order deltas of the original query. The second-order deltas, of course, each have third-order deltas, and so forth. This continues, building up a hierarchy of auxiliary views, each used to efficiently maintain their parents in the hierarchy. Even though this hierarchy has many views in it, it's still typically possible to maintain them efficiently.

Of course, there are always corner cases. This week, I'm going to discuss one of them: Input Variables. Let's have a look at a relatively straightforward query:

Q := AggSum([], R(A,B) * S(C,D) * {A < C})

Or in SQL

SELECT COUNT(*) AS Q FROM R, S WHERE R.A < S.C

Innocuous as this query is, when we take its delta, we run into a problem. Let's take the delta with respect to R (the delta with respect to S is nearly the same).

dR(<X,Y>) Q := AggSum([], S(C,D) * {X < C})

In other words, whenever we insert a tuple (row) into R, we need to run the above query, substituting X and Y with values from the tuple being inserted into R. dR Q is certainly simpler, but there's a problem. Recall that special tables (like {A < C} or {X < C}) have an infinite number of rows. This was fine in the original query Q, because the query had terms that limited the number of distinct values of A and C that we were interested in. The special table might have had an infinite number of rows, but the overall query did not.

In the delta query however, there's no term limiting the number of distinct values of X that we're interested in. It's ok when we actually evaluate the delta query because we have a value of X (from the tuple being inserted into R), but until we get that value we can't actually compute a value for it. In other words, we can't store the results of the query.

There's actually a term for this in programming languages and query processing: X is known as an unbound or unsafe variable (or sometimes a range restricted variable). AGCA calls it an input variable (or parameter) of dR Q. We don't know what it is, so we can't evaluate the expression.

There are a few things we can do in this situation. If you're particularly familiar with query processing techniques, you might look at this query and say "But wait, we can actually materialize this using a range tree (or similar index structure)." And you'd be right. Of course, then I could give you a more complex query (e.g., replace the inequality with an arbitrary black box function f(A, B)) and we'd be right back where we started. For now, let's assume that it's simply impossible to materialize the entire expression in one go.

So what else is left? Well, if we can't materialize the entire thing, then what about materializing it in bits? We can create one or more views that can be stored efficiently, and then do some (but not all) of the heavy lifting afterwards, once we actually know what X is.

Let's make this a bit clearer and procedural. We have a query

AggSum([], S(C,D) * {X < C})

Now I'm going to introduce an extra little bit of syntax into AGCA. We'll call it the materialization operator M. Everything in the materialization operator is going to get materialized. Everything outside of the materialization operator is going to be evaluated when the query results need to be accessed. An AGCA query with a materialization operator in it is called a materialization decision.

We arrive at a final materialization decision by starting with the default (naive) decision where we materialize everything.

M(AggSum([], S(C,D) * {X < C}))

… and then iteratively refining the decision until we arrive at a satisfactory one. As I've been saying, a materialization decision with an input variable is not valid, so we need to rewrite it. Input variables only appear in these special tables (the ones in the curly braces), so the basic idea is actually pretty easy. We'll start by pushing the materialization operator inside the AggSum:

AggSum([], M(S(C, D) * {X < C}))

Now we're looking at a materialization decision applied to a product. We can split the materialization operator across the elements of the product so that only the parts without input variables get materialized

AggSum([], M(S(C,D)) * {X < C})

And we're there. This materialization decision is valid, but not quite as efficient as it could be.

Specifically, look at what we're storing: S(C,D). We care about the individual values of C (because they get applied to the predicate {X < C}), but D is never used, and will actually get aggregated away. We can save ourselves a little trouble when we need to evaluate the delta query by storing only an aggregated value. In other words, We can tack on an extra aggsum.

AggSum([], M(AggSum([C], S(C,D))) * {X < C})

Note that C is a group-by variable of this AggSum because it is needed by the predicate.

Alright. Hopefully that gives you a bit of the flavor of rewriting queries to support input variables. The details of this process are actually quite messy, but I'll see if I can cover them in detail next week. For the impatient, our VLDB paper "DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views" gives a reasonable overview of the process.