The ODIn Lab - Optimizing AGCA (Part 3: Unification)

Last week we covered equality lifting, the first half of a two-part process for simplifying expressions. The second part is commonly known in PL circles as Unification. In some expressions, it's possible to eliminate a lift by inlining the expression being lifted into a variable.

For example, let's say you have the following expression:

AggSum([], (A ^= B) * A)

For all practical purposes, that lift doesn't need to be there. Instead, we can rewrite this expression as

AggSum([], B)

Much simpler (and to make things even better, we can get rid of the AggSum too, since the inner expression now has no output variables).

Also, keep in mind that if the expression being lifted has already been fully evaluated (down to a simple numeric value), unification might allow us to do even more evaluation down the line.

Fundamentally, that's all there is to this week's theme. Take lifts and propagate their values through the expression. Unfortunately, as with many things, the devil is in the details. There are a number of situations where unification is simply not possible, and some situations where it's possible, but only with a bit of a hack. So, let's get to it. What do you need to be aware of when unifying lifts in AGCA?

Syntactic Restrictions

Simple lifts like (A ^= B) can be unified anywhere. However, as you may have noticed, more complex expressions can appear on the right-hand side of a lift. For example, in the expression

AggSum([], (A ^= B+1) * R(A))

The syntax of AGCA doesn't allow us to write an expression like

AggSum([], R(B+1))

Admittedly, this is a somewhat trivial case, but as you see when we get to nested subqueries, there's a good reason for this.

Respecting the Scope and Schema of the Complete Expression

Recall the definitions of the scope and schema of an expression being evaluated. The scope is the set of variables that are already bound when the expression is evaluated, and the schema is the set of output variables that we're expecting the expression to bind. If the variable being lifted into appears in either the scope or the schema, it can not be unified. For example, in the expression

AggSum([], R(A) * ((A ^= B)+(A ^= C)))

We can't eliminate the lifts, because by the time we get to the two lifts, the variable A is in scope already. That said, we can do a little bit of rearrangement. For example, the expression

AggSum([], ((A ^= B) * R(A)) + ((A ^= C) * R(A)))

is a legitimate rewriting of the first that can be unified. I'll get into some of these rewritings next week, but most of them are really quite trivial. Perhaps more challenging is when the variable is in the schema of an expression. For example:

AggSum([A], R(B) * (A ^= B))

Now, in this case, we're not allowed to unify A away because it's part of the scope (A must appear in the output). Yet, there's still a possible simplification of this expression:

AggSum([A], R(A))

Note, by the way, that simply replacing the lift with an equality and relying on equality lifting to resolve the issue won't work, since A is already out of scope -- we're not allowed to replace it with an equality. Instead we need a special case to handle this. If the lifted expression is a simple variable that's not in the scope then we have a chance!

We start with the product expression that the lift is a part of. In this case:

(R(B) * (A ^= B))

From here, we backtrack, exactly like we do with equality lifting, until the lifted variable (B) falls out of scope, and then we can attempt to replace all instances of the lifted variable with the variable being lifted into (i.e., replacing all Bs with As).

Respecting the Scope and Schema in which the Complete Expression is Evaluated

Of course, even with these rewritings, it's possible that neither of these conditions will be satisfied due to external forces. When an AGCA expression is evaluated, the caller can provide an external scope, or an expected schema. The most trivial case of this is an expression like

(A ^= B)

If this is the entire expression being evaluated, the caller must (through external methods) provide a B. Similarly, the caller expects to read out a result containing a single column: A.

Because all of this is dependent on the caller, there's very little that we can do about this inside the AGCA framework. One technique that we've had success with is to use the standard transformations that I'll discuss next week to propagate all the lifts to the head (or as close to the head as possible) of the expression being evaluated, and then explicitly to pick out all the lift expressions that rename variables appearing in the schema.

For example, if we were preparing to evaluate the above expression with an externally defined scope containing only B, then we would note the presence of the rewriting (A ^= B) at the head of the expression. We would eliminate this lift from the expression, and replace every instance of B with A. Then, when evaluating the expression, we would bind A to the value that we would have previously bound to B.

And that's it for this week. Next week, I'll be going over several simple rewrite rules that allow us to minimize the use of AggSum, and other forms of nesting in AGCA.