The ODIn Lab - Minnowbrook / Aggregation Trees

As part of the price of attendance at Kris Micinski's Minnowbrook Datalog Seminar, he asked everyone to produce a blog post. It's now over two months since the seminar, and I'm out of excuses to produce something, so here goes a random thought that came out of a discussion with Thomas Gilray: Heaps are just specialized Aggregation Trees.

The legend of Max and the DBToaster

First, a bit of background. Our story starts about 14 years ago, with a system called DBToaster. DBToaster is what's called an incremental view maintenance system. That is, you give it one or more queries, and it produces some highly optimized code that will dynamically recompute the results of those queries as the inputs change. What makes the system incremental is that instead of recomputing each query from scratch, every time the input changes, it only needs to recompute how the result changes.

As a very simplified example, say we have a collection of integers (R = {| 1, 3, 2, 1 |}). The sum of this collection (SUM(R)) is 7. Now say we insert a new number into the collection: 3. We have two options:

Insert 3 into R and then compute SUM(R) = 1 + 3 + 2 + 1 + 3.
Compute the update SUM(R) += 3 and then insert 3 into R.

The latter operation is an option because addition (over the integers, reals, etc...) is a commutative, associative operation. It's also far far far faster, since we don't need to revisit each and every element of R to compute an updated value. The neat thing is that this still works when we try to delete an element. For example, if we want to remove the 2, we again have 2 options:

Remove the 2 from R and then compute SUM(R) = 1 + 3 + 1 + 3.
Computer the update SUM(R) -= 2 and then remove 2 from R.

This works because addition (over the integers, etc...) has an inverse operation. That is, for every A, there's a value -A such that A + B + (-A) = B. This is really neat, because it means that we can store only the sum, and don't need any other additional information to maintain SUM(R) as R changes. The problem is that, while this works for addition-based aggregates (e.g. SUM, COUNT, AVERAGE), not every aggregate has an inverse. For example, in general, given the value max(A, B), there's nothing you can combine via max to recover B after "removing" A.

The typical 'fix' for this problem is to maintain more information. Specifically, for max (or min) we could maintain the entire collection of integers in R so that we could remove individual elements. However, every time the underlying collection changes, we need to recompute the max value, something that, in general, would take us O(|R|) work (one 'step' per element in R). That is... not ideal.

The Heap

Fortunately, there's a data structure that works out well with max (and min): The Heap. A heap is a neat little tree-shaped data structure (there are some fancy ways of encding a heap into an array that I won't get into here). A heap stores a collection, with each node of the tree storing exactly one collection element (repetition allowed). There are a few quirks to how a heap is constructed, but the basic requirement is that the value stored at every node of the tree is greater than all of its descendents. For example, the following heap stores the collection {| 1, 1, 2, 2, 3, 5 |}:

Notice how every node's value is greater than that of the nodes below it. Now, the key trick of the heap is that:

You can always read out the greatest element of the collection in constant time by looking at the root of the tree.
If you remove an element from the collection, you can restore the heap in a logarithmic (O(log |R|)) number of steps.
You can insert an element into the heap in a logarithmic (O(log |R|)) number of steps.

Note that this is much better than having to perform a linear number of steps to update the maximum value naively.

Getting back to DBToaster, about a decade ago, I was thinking about how to add support for max, min, and more. Obviously we could use a heap structure to solve this problem, but heap only gets us max (and min). I was wracking my brain looking for ways to generalize the heap to other aggregates. It wasn't until a chat with Thomas Gilray that the solution hit me: Aggregation trees.

Aggregation Trees

An aggregation tree (which goes by other names as well) is a tree-shaped data structure that encodes a collection. In an aggreation tree, (i) every leaf node stores an element of the collection; and (ii) every inner node stores the aggregate value of every collection element it contains. Using our example heap collection, and the SUM aggregate, we might get the tree:

        14
      /    \
    10      \
   /   \     \
  3     7     4
 / \   / \   / \
1   2 2   5 3   1

As with the heap (assuming we use some balanced tree):

We can read out the overall aggregate value in a constant number of steps from the root.
We can insert new elements into the tree and update all inner nodes in a logarithmic number of steps.
We can remove elements from the tree and update all inner nodes in a logarithmic number of steps.

The Insight

Now let's see what happens when we use an aggregation tree with the max aggregate.

         5
      /    \
     5      \
   /   \     \
  2     5     3
 / \   / \   / \
1   2 2   5 3   1

If you look closely, each inner node stores an exact copy of one of the leaf nodes. This is a bit redundant. What if we just stored one copy of each at the point where the value is stored:

Hey! We're back to the heap!

That's it. That's the magical insight: A heap is just a highly specialized aggregation tree. To be precise, * An aggregation tree can be used for any commutative monoid * If the commutative monoid is a group, we only need the root element * If the commutative monoid is absorptive, we can use a heap.

Until next time!