The ODIn Lab - AGCA, The language of change (Part 4: The table special)

For several weeks now, I've been describing AGCA, a language for incremental processing.

In part 1, I covered the core idea behind AGCA: Instead of storing data as a list of rows, AGCA keeps only one copy of each unique row, and tags it with the number of times it appears. In other words, data in AGCA is a function that maps each row to the number of times that it appears in the data (the row's multiplicity).

In part 2, I showed you how AGCA handles unions (it sums multiplicities, so we call it +) and joins (it multiplies multiplicities, so we call it *).

Most recently, in part 3, I showed you how AGCA handles the COUNT(*) aggregate, and how COUNT(*) and projection are actually the same thing in AGCA. After columns are projected away, the results might end up containing duplicate rows -- so the multiplicities of these rows get added together to produce the final result.

This week, I plan to talk about the SUM() aggregate, and selection (filtering) in AGCA. Let's start with SUM(). An interesting thing to note about COUNT(*) is that it's a special instance of SUM(). Specifically

SELECT COUNT(*) FROM R;

is completely equivalent to

SELECT SUM(1) FROM R;

In other words, for every row that appears in the result, we add the numerical value '1' to the final result. Straightforward enough, right? But what if we wanted to add a more complex/interesting value to the result? Say we wanted to compute the following SQL query

SELECT SUM(2) FROM R;

In other words, for every copy of every row that appears in the result, we add the numerical value '2' to the final result. Just to recap, the AGCA for the COUNT(*) of R is:

AggSum([], R(A,B))

This query takes the multiplicity of every row of R, and sums them, giving us SUM(1) of R. In order to compute SUM(2), we'd have to multiply those multiplicities by 2.

As I pointed out in week 2, joins multiply multiplicities. So what we need is a special kind of table. We need a table with one row with a multiplicity of '2', which matches with every row of R. We need a table that looks like this:

________#__

 < > -> 2

Because there are no columns in this special table, it joins with every row in any table/query you can come up with. We can call this special table {2} when writing down AGCA queries:

R(A,B) * {2}

Which says "Give me two copies of every row of R". Incidentally, as I mentioned before, AGCA forms an algebraic structure called a ring. One of the properties of a ring is distributivity. That is,

R(A,B) * {2} = R(A,B) * ({1}+{1}) = R(A,B) + R(A,B)

Or in other words, two copies of everything in R is identical to the union of R with itself (which is, of course, true). So in summary, the AGCA query to compute the SUM(2) of R is

AggSum([], R(A,B) * {2})

What's cool about this is that one of these special tables can be created for any number that we want to sum up. AGCA is not just limited to positive integers, so there's nothing wrong with the AGCA query:

R(A,B) * {-1}

or even the query

R(A,B) * {3.14159265}

Let that sink in for a moment, because if this doesn't weird out out you're probably missing something. I've been talking about multiplicities as the number of times a row occurs in a data set. AGCA plays multiplicities a lot more fast and loose. An AGCA query result can include fractions of rows, or even rows with negative multiplicities -- a sort of anti-row:

R(A,B) + R(A,B) * {-1} = R(A,B) * ({1} + {-1}) = R(A,B) * {0}

Negative multiplicities will turn out to be super-useful for incremental maintenance, as you'll see.

Moving on, the ability to multiply multiplicities by constants is useful, but what we really need is the ability to compute sums with variables in them. For example:

SELECT SUM(A) FROM R;

In other words, for every copy of every row that appears in the result, we want to take the value of column 'A' and add it to the final result. AGCA actually has a special table for this as well. If we call that table {A}, the AGCA for SUM(A) of R is

AggSum([], R(A,B) * {A})

So what does {A} look like? Well, let's say we have some example data in R:

___A__B______#__

 < 1, 1 > -> 2

 < 2, 2 > -> 1

 < 2, 5 > -> 3

We want to turn this into:

___A__B______________#__

 < 1, 1 > -> 2 * 1 = 2

 < 2, 2 > -> 1 * 2 = 2

 < 2, 5 > -> 3 * 2 = 6

{A}, when joined with R(A,B) should multiply the rows where A is 1 by 1, where A is 2 by 2, and so forth… In other words, we want a table that can be joined with R(A,B) on A. A table with one row for every possible value of A in its 'A' column, and the same value as each row's multiplicity.

___A______#__

 < 1 > -> 1

 < 2 > -> 2

 < 3 > -> 3

...

Unlike the special table for constants, the special table for variables actually has an infinite number of rows in it. As I've mentioned before, technically, all AGCA queries produce an infinite number of results, but only a (relatively) smaller number of 'interesting' ones (in technical terms, the query results have 'finite support'). That's not the case with this kind of special table. There are an infinite number of rows. So a query like:

{A}

Would produce an infinite number of rows. Of course, such a query doesn't really make sense either. You're always going to join this kind of special table with a table that doesn't produce an infinite number of (interesting) rows, and that has 'A' as a column. When you do that, all but a few of the rows of {A} will get zeroed out. Try it yourself on the SUM(A) example if you're not convinced.

By the way, what I am describing is identical to a concept in programming languages/logic called bound and free variables (also known as safe/unsafe variables in query processing). 'A' is a free variable in {A}, and a bound variable in R(A,B). Just like in SQL, a query with any free/unsafe variables in it has to have an external assignment of values to those variables before it can be evaluated. More on that later.

On a related note, AGCA treats columns more as variables than anything else. For this reason, I'll be using the terms 'column' and 'variable' interchangeably from now on.

One last thing before I move on to selection. It's possible for AGCA to represent even more complex special tables. For example, if I wanted wanted to compute the following SQL query

SELECT SUM(exp(2, A) + 2*B) FROM R;

Any computable expression can be turned into one of these special tables, regardless of how many variables appear in it. To compute the the SQL query above I could use

AggSum([], R(A,B) * {exp(2, A) + 2 * B})

Which is equivalent to

AggSum([], R(A,B) * {exp(2, A)} + {2 * B}) = AggSum([], R(A,B) * {exp(2, A)} + {2} * {B})

Regardless, every variable that appears in the formula defining a special table will be one of the special table's columns.

AGCA uses the very same idea to implement selection (filtering). Let's say we wanted to compute:

SELECT COUNT(*) FROM R WHERE A < B;

In other words, we want to filter out some rows of R (the ones where A >= B), and keep others (where A < B). Again, AGCA deals entirely in multiplicities. Filtering out a row is equivalent to setting its multiplicity to 0. Keeping a row means leaving its multiplicity unchanged.

___A__B______________#__

 < 1, 1 > -> 2 * 0 = 0

 < 2, 2 > -> 1 * 0 = 0

 < 2, 5 > -> 3 * 1 = 3

Just like a special table exists for any expression, any boolean predicate can be turned into a special table where each row has a multiplicity of either 0 or 1. For example, for A < B:

___A__B______#__

 < 1, 1 > -> 0

 < 1, 2 > -> 1

 < 1, 3 > -> 1

...

 < 2, 1 > -> 0

 < 2, 2 > -> 0

 < 2, 3 > -> 1

...

And there you have it. Computing the filtered count of R is as simple as

AggSum([], R(A,B) * {A < B})

Note that this makes the special tables {1} and {0} equivalent to the booleans TRUE and FALSE respectively. Also note that while * is equivalent to a boolean AND, + is not equivalent to a boolean OR ({1} + {1} = {2}). More on how we handle this later.

With that, I've covered all of the basics of AGCA. Next week, a quick reference summary of everything that I've covered so far, and the week after that, I'll dive into the viewlet transform itself.

By the by, I'm ignoring two features of AGCA for now, one needed to support nested aggregates, and one needed to support existential quantification (as well as certain kinds of nested aggregates). I'll get to these once I've covered the viewlet transform and can better describe the challenges that nested aggregates create.

And to anyone who reads this before tomorrow: Keep an eye on http://www.dbtoaster.org for the official DBToaster release on July 9.