The Viewlet Transform (Part 4: Input Variables and Partial Materialization Continued)

Last week, I talked about a variation on the core viewlet transform idea.  The delta operation introduces input variables into a query, which can not be properly materialized.  Often, these input variables can be eliminated through variable unification (something I'll start getting into in a week or two), but not always.  In these cases, it is necessary to materialize the delta query in parts.  

We do this by splitting a query Q into two (or more) parts Qmain, and Q1, Q2, … etc.  We materialize Q1, Q2, … etc, and then whenever we need to evaluate Q, we compute Qmain(Q1, Q2, …).  We express this partitioning by a special materialization operator M, and recur through the query expression to find the exact bits we can materialize.

Before I actually get into the partial materialization process, let me quickly introduce four bits of nomenclature regarding queries (that I'll define more thoroughly next week): inputs, outputs, scope, and schema.  The inputs and outputs of an expression are the unbound and bound (respectively) variables appearing in the expression.  The scope of an expression is the set of variables that are bound when the expression is evaluated.  The schema of an expression is the set of variables that an expression is expected to bind.  Note that while the inputs and outputs of an expression are uniquely identified by the expression, scope and schema are contextual, and can change depending on how the expression is evaluated.  Also note that any inputs must be in the scope, and that anything in the schema must be in the outputs of a query.

That said, we eliminate input variables through a recursive process that starts with a fully materialized query


and recursively descend into the expression using the following rules, until the expression chosen to be materialized (the materialization decision) has no inputs.


Materializing AggSums

M(AggSum([…], Q))

If we have an AggSum that we need to materialize, one of two things can happen.  If the AggSum has no inputs, we're done.  If the AggSum has input, then we need to recur, and push the materialization operator inside the AggSum. 

AggSum([…], M(Q))

As I mentioned last week, we can actually do better. As we push the materialization operator down into the AggSum, we keep track of the variables used by the AggSum.  Note that the schema of the query nested inside the AggSum (Q, that is) must be identical to the group-by variables of the AggSum.  As we push the materialization operator down into the query, we keep track of its schema.  When we finally settle on a location for the materialization operator, we look at both its schema and its outputs.  If there are more outputs than the schema calls for, we add an additional AggSum to trim the unnecessary outputs away.  

Materializing Relations, Value Expressions, and Comparison Predicates

M(R(A, B, …))
M({f(A, B, …)})
M({f(A, B, …) θ g(A, B, …)})

Relations never have inputs and are always materialized.  Value expressions and comparisons are exclusively inputs, and are never materialized alone (i.e., unless bound by a relation)

Materializing Unions and Joins

M(A + B + …)
M(A * B * …)

As with AggSums, unions or joins with no inputs are always materialized in their entirety.  For both unions and joins, we first partition the expression into a subset of the expression with no inputs (A, B, …), and multiple subsets with inputs (C, D, E, …).  We materialize the input-free bit as is, and recursively descend into the remaining components

M(A + B + …) + M(C) + M(D) + …
M(A * B * …) * M(C) * M(D) * …

There are a few caveats for materializing joins.  Specifically, there can be ordering constraints (which I'll get into next week) over the terms of a join (they don't quite commute).  It may sometimes be necessary to partition the expression into multiple subsets for materialization if there is a term (let's call it C) that must occur to the right of (A*B), and to the left of (D*E), then we would choose to materialize it as

M(A * B) * M(C) * M(D * E)

Materializing Lifts

M(A ^= {B})

A lift is tricky in that it involves both input and output variables.  Typically, the lift will get unified away (again, something I'll talk about in a few weeks).  Other times, it may be possible to include the lift in an input-free query if another relation binds the inputs of the lift (in which case it'll get caught by the Union/Join case).  In either case, if we've gotten to this point, the best we can do is to not materialize anything.

When I get into nested subqueries, and we start using lifts for more complex things, this rule will need to be changed.


Alright.  I know this week was short (and a bit cheap), but the CIDR deadline's this weekend.  Next week, I'll get back to some more interesting stuff, with optimization techniques for AGCA.  Until then, cheers.