The ODIn Lab - The Viewlet Transform (Part 5: Hypergraph Partitioning)

I've been talking for several weeks now about tools and techniques related to AGCA and the viewlet transform. Most recently, I've been talking about optimization techniques for AGCA, but I'm going to take a quick detour this week and provide a quick overview of another technique: Hypergraph Partitioning. In general, this technique is most suited for optimizing the materialization process, but there are applications to the optimization of aggregate computations as well.

The Query Hypergraph

Before I get into the technique though, we need to discuss an alternate representation of AGCA expressions (one that's actually used pretty frequently in query optimization): the query hypergraph (basically a graph where an edge can connect any number of nodes. This kind of hypergraph can be created for any product of terms (in the trivial case, we have a product of just one term). Each node in the hypergraph is a variable/column of the query (both output and input variables are treated identically for this purpose). Each hyperedge corresponds to one term in the product, and each edge connects all variables that appear in the term corresponding to the edge (regardless of whether they appear as inputs or outputs).

Hypergraph Partitioning

Remember that the product operator corresponds to the natural join (and that comparisons are implemented as relations). As a consequence, any disconnected components in the graph effectively correspond to cross products (a natural join with no shared columns). For example, consider the following trivial example.

R(A) * S(B)

R(A) is a hyperedge touching only A. S(B) is a hyperedge touching only B. Thus A and B are separate disconnected components. Note, by the way, that there are no comparisons between A and B in this query. This product is a pure cartesian cross-product. The following query would not be:

R(A) * S(B) * {A < B}

In this query, the term { A < B } connects both A and B.

Now, if we have disconnected components, it typically pays to materialize them separately. For example, going with R and S above, we could materialize them as

M( R(A) * S(B) )

But now we have to store |R| * |S| entries (where |R| is the number of tuples in R). Worse, if we need to update the materialized view, it will cost us |S| after an update to R, and |R| after an update to S. On the other hand, we could materialize as

M(R(A)) * M(S(B))

Now we only store |R| + |S| tuples (between the two materialized views), and updating either can be done in constant time. Better still, we lose nothing with this representation. It costs us O(|R|*|S|) to iterate over every element of either materialization of the expression.

You might say that this is a crazy corner case -- people almost never compute cross products. That's usually true, but in DBToaster, this situation crops up quite frequently. For example, consider the three way join query:

R(A) * S(A,B) * T(B)

The (optimized) delta of this query with respect to +S(dA, dB) is

R(dA) * T(dB)

Because each delta essentially removes a hyperedge in the query hypergraph, partitioned components are created extremely frequently.

Partitioning and Trigger Parameters

There's also one more situation where this is beneficial. Consider the following query.

R(A) * S(A) * T(A)

And its delta with respect to the insertion +S(dA)

R(dA) * T(dA)

Even though dA is touched by both R and T, we lose nothing if we materialize them separately (as before, evaluation is O(1) either way), and materializing them separately results in more efficient maintenance. In this case, dA is a trigger parameter -- one of the variables drawn from the relation being modified. These trigger parameter variables can be excluded from the query hyper graph.

Applications to Query Optimization

In general, when computing aggregates, hypergraph partitioning can be used to select a more efficient computation order. Each materialized component gets scanned independently, and the resulting aggregate can be computed.

And that's about it for now. Next week, we return to AGCA optimization with a discussion of the interplay between equality and lifts, and how to optimize expressions of this form.