The ODIn Lab - Optimizing AGCA (Part 1: Ringing in the optimizations)

Although AGCA is designed primarily for incremental query evaluation, it is a fully fledged query language (albeit only for non-aggregate queries and certain kinds of aggregates). As such, it's useful to have a strategy for optimizing arbitrary query expressions. As it turns out, optimization is relevant, even in the incremental case, as it can often produce simpler expressions that are easier to incrementally maintain. Over the next few weeks, I'll discuss several techniques that we've developed for optimizing, simplifying, and generally reducing the cost of evaluating AGCA queries.

But before I get into any of that, let me quickly bring up one point that I've been glossing over, mostly as a point of convenience.

By default, AGCA expressions are evaluated as the English language is read: Left to Right

This has two consequences. First, ordering has an impact on query evaluation performance. We'll be returning to that before long. For now though, the important feature is that information flows left-to-right in AGCA as well. Specifically, consider the following expression:

R(A) * {A}

or in SQL

SELECT SUM(R.A) FROM R

In the SQL query, you can think of information as flowing from the R table to the SUM operator. This notion of information flow is pretty common in programming languages, and AGCA incorporates it as well. I mentioned the idea of binding patterns when I first introduced the special tables used for value expressions (i.e., {A}). The term R(A) binds the A variable, which is then used by the term {A}. In short, information is flowing through the product operation from R(A) to {A}. In AGCA, this information flow is always left to right. This is more than just a matter of convenience. It makes it possible to identify binding patterns in a single scan of the query, rather than an exponential search, which in turn makes many of the optimizations that I will discuss tractable.

Unfortunately, this also has the side effect of making certain expressions (sometimes) invalid. For example, the expression

{A} * R(A)

could not be evaluated in isolation. However, the expression

S(A) * {A} * R(A)

is be perfectly valid. In other words, AGCA's product operation is not generally commutative (A * B ≠ B * A). Many terms do commute (e.g., R(A) and S(A)), but commutativity is not always possible. Worse still, the commutativity of two terms can not be determined locally. As in the above example, {A} and R(A) in isolation do not commute. However, if the variable A is bound outside of those two terms, then commutativity is possible. That said, commutativity can be determined locally if the scope in which an expression is evaluated is known (and this information is typically available during optimization).

From now on, I'll be assume that commutativity can be easily determined.

As I present each of these rules, I'll briefly discuss the core ideas of each optimization, comment on how the rule interacts with both incremental and batch query evaluation, and then summarize the rules as a set of transformations over AGCA expressions.

Pre-evaluation

A relatively straightforward, practically braindead optimization that appears in nearly all compilers is constant folding. If an expression such as 1+2 is fed into the compiler, the compiler silently turns that into a constant 3. DBToaster does this, but there are several nuances that arise in the DBToaster case.

Firstly, as I've mentioned before, AGCA is a ring. Thus, the constants 1 and 0 have some useful properties. When an expression is multiplied by 1 or added to 0, the result is unchanged, so we can eliminate any appearance of {1} in a product, or {0} in a sum. Furthermore, whenever {0} appears in a product, the entire product term can be replaced by {0}. Although it might seem unlikely that such a query would ever arise, {0} terms actually appear quite often in the incremental processing case. The delta rule produces a {0} for a large number of different terms, and queries with nested subqueries (which I'll get to one of these days) often produce queries with a ({1} + {-1}) term.

A second nuance is value expressions themselves. Practically speaking, the following two expressions are identical

{A}+{2} == {A+2}

For a number of reasons, the latter form is considerably more efficient most of the time. I'll get into the details in two weeks when I talk about a materialization optimization called Hypergraph Factorization that I'll talk about next week, but essentially, we (almost) always want to put value terms together.

The Rules

{0} + Q => Q

{0} * Q => {0}

{1} * Q => Q

{X} + {Y} => {X + Y} (see caveat for Hypergraph Factorization)

{X} * {Y} => {X * Y} (see caveat for Hypergraph Factorization)

{f(X, Y, Z, …)} => {eval(f(X, Y, Z, …))}

Polynomial Factorization

One additional feature of rings is that the product operation distributes over union. In other words

A * (B + C) <=> (A * B) + (A * C)

This operation goes both ways, and we can take advantage of that to reduce our workload. Whenever we encounter an expression like (A * B) + (A * C), we can rewrite it as A * (B + C), and save ourselves a re-evaluation of A. This is particularly important for incremental processing, as the delta operation frequently produces expressions of this form (and better still, B and C are typically value terms that can be further optimized by pre-evaluation).

This optimization bears similarity to both a programming language technique called common subexpression elimination, as well as tradition arithmetic factorization (hence then name). That said, there's a nuance in factorizing AGCA expressions that doesn't arise in arithmetic factorization: commutativity. Let's say we have an expression of the form

(A * B) + (C * A)

If this were an arithmetic polynomial, clearly we could factorize out the A. Unfortunately, in AGCA, terms don't always commute. This leaves us with two possibilities: Either we can commute the A with the C and produce the following factorized expression:

A * (B + C)

Or we can commute the A with the B and produce the following factorized expression:

(B + C) * A

Note that in the latter case, the A appears after the factorized terms in the expression.

In general, factorization is a hard problem. In any given polynomial, there might be several terms that could be factorized out of the expression. For example

(A * B) + (A * C) + (D * B)

This expression can be factorized out to one of the two following expressions

(A * (B + C)) + (D* B)

((A + D) * B) + (A * C)

It's not always clear which of these will be more efficient to evaluate. A simple heuristic is to factorize out terms that are guaranteed to involve a cost (e.g., table terms), but a cost-based optimizer is typically the most effective. We'll get to that in a few weeks.

The Rules

(… * A * …) + (… * A * …) + … => A * (… * {1} * …) + (… * {1} * …) + …) (if A commutes to the head of each term)

(… * A * …) + (… * A * …) + … => (… * {1} * …) + (… * {1} * …) + …) * A (if A commutes to the tail of each term)

That's it for this week. Next week, I'll jump back to the optimized viewlet transform with Hypergraph Factorization.