The ODIn Lab - Optimizing AGCA (Part 2: Lifting Equalities)

I'm going to turn, this week, back to optimization of AGCA expressions, and in particular, one pair of optimizations that combine to substantially simplify AGCA expressions: Lifting Equalities, and Equality Unification.

Recall the four sets of variables that we work with when evaluating any AGCA expression:

Scope variables are variables that are bound (assigned values) by the time the AGCA expression is evaluated (either earlier in the expression, or outside of it).
Schema variables are variables that something outside of the expression being evaluated expects to be bound by this expression.
Input variables are variables that are not bound in the expression we're evaluating (every input variable must be in the scope when the expression is evaluated, but not every variable in the scope must be an input variable)
Output variables are variables that are bound in the expression we're evaluating (when evaluating the expression, every schema variable must be an output variable; if an output variable is in the scope, the expression is treated as a lookup or join)

Now, note that because any output variable may be in the scope when an expression is evaluated, the following three expressions are more/less equivalent. All three have the same input and output variables, and react identically to scope/schema changes from the outside.

R(A,B) * {A = B}

R(A,B) * (A ^= {B})

R(A,B) * (B ^= {A})

Note, by the way, that this is only possible due to the R(A,B). In the following lift operation, B is an input variable and A is an output variable.

(A ^= {B})

If we were to look at only the lift/comparison operation (without anything that binds both A and B), then A and B would be input variables for the equality comparison, and one of them would be an output variable in either lift. In other words, the only difference between these three is which variables are bound and when.

Now, in general (and in one specific way that you'll see momentarilly), output variables are good. We like output variables, so when it comes to equality predicates, we want to transform them to lifts whenever possible.

Let's look at an example:

R(A) * S(B) * {A = B}

This expression is a simple, straightforward equi-join, but is somewhat inefficient. For every row of R, we'll loop over every row of S, and then pick out only the pairs of A x B where the two variables are identical. In other words, this is effectively a nested loop join. Now, consider, consider the following (equivalent) expression:

R(A) * (B ^= {A}) * S(B)

We've gotten rid of the nested loop. Now, every row in R is extended with a new column B with the same value as the A column, and we do a lookup on that one row of B. This is effectively a hash join (assuming we have a hash index already built over S).

Equality Lifting

So how do we generalize from this example? Let's start with products of simple (relation, comparison, arithmetic expression, and lift) terms. Our goal is to get an expression Q of this type into the form

Q := X * Y * {A = B}

Where X and Y are both individual expressions with the following properties (this example works just as well if you swap A and B):

A is bound in X, and may also be bound in Y
B is bound in Y but not X
(and for reasons that will become apparent next week) If possible, B should not be in the schema with which we evaluate Q.

We can replace this equality constraint with a lift in either direction: (A ^= {B}) or (B ^= {A}), but given the first two constraints, it makes the most sense to replace it with (B ^= {A}). That way, we can commute the lift all the way to the left and get

Q := X * (B ^= {A}) * Y

As it turns out, working with simple products is not all that restrictive. We can treat all the remaining operators as if they were simple terms, and use a handful of other transformations (that I'll get to in two weeks), to deal with the nesting structures inside them (e.g., each term in a sum, or the expression being aggregated). In other words, this is pretty much the algorithm. Partition (if possible) each expression into two independent subexpressions that each bind one of the variables on either side of the equality, and then substitute the equality with the relevant lift term.

One other note: When dealing with an equality comparison with a more complicated expression in it, you might have additional restrictions on what you can lift. For example, you might have:

R(A) * S(B,C) * {A = B * C}

In which case, you could only substitute in (A ^= {B * C}). Fortunately, in this case, we can also commute the earlier terms in the expression into an amenable form:

S(B, C) * (A ^= {B * C}) * R(A)

Alrighty. Next week, we cover an optimization designed to interact with equality lifting: Unification.