The ODIn Lab - AGCA, The language of change (Part 1: Everything is Multiplicities)

Before I can go into detail on the viewlet transform, I first need to talk about a language called the AGgregate CAlculus. Over the coming weeks, I'll try to give a high-level overview of the language from a practical perspective, and hopefully give you some insight into why it is important.

Although the name "Aggregate Calculus" might sound imposing, AGCA is actually very close to SQL. Its key feature (one of the reasons that it is crucial for the viewlet transform) is that it separates values that can be maintained incrementally from those that cannot. For reasons that will become clear, we'll call incrementally maintainable values multiplicities, and non-incrementally maintainable values variables.

The core mantra… the one thing that everyone who I've ever tried to teach AGCA to has struggled to wrap their head around (at first, anyway), is that Everything is Multiplicities. Remember this phrase. Everything is Multiplicities.

I'm being a little overly general. If I wanted to be precise, I'd say that "Everything is a mapping from tuples of variables to multiplicities." That's a bit of a mouthful (and I like short catch phrases), so let's stick with Everything is Multiplicities.

And just to be sure you're following along: Everything is Multiplicities.

So, how do we write queries in AGCA? Well, to start, we need some way to refer to our input data. In spite of trends in the corporate world, AGCA works with relational data. So, all all of our inputs will be tables (or if you want to get fancy, "relations"). If we want to refer to a table, we write down its name and give each of its columns a name. For example, if we write down:

R(A,B,…)

We mean "all of the rows (or tuples) of R, and we'll name first column of R 'A', name the second column 'B', and so forth…" This is pretty much the simplest possible SQL query you can think of:

SELECT A, B, … FROM R;

Simple, right? Well, ok, there's actually a little twist. See, in SQL, you're allowed to have several identical rows in a table (by default anyway, keys have to be added explicitly). To use the technical term, SQL works with what are called multisets (also known as "bags"). So, we're going to do something clever in AGCA. Let's say you have a table of your customer's first names. If you write the AGCA expression:

CUSTOMER(FIRSTNAME)

You're not going to get one row for every customer. Instead, you're going to get something like this:

__FIRSTNAME______#__

< John      > -> 8

< Joe       > -> 3

< Steve     > -> 5

< Alphonse  > -> 1

That is to say, you'll get one output row every distinct row in your table, together with the number of times that this row occurs in your table (that is, you get its multiplicity). In the above example, you have 8 customers named John, but only 1 customer named Alphonse. This is the core idea of AGCA: Everything is multiplicities. If it helps, you can think of every AGCA expression as having an implicit group-by COUNT(*) aggregate around it.

CUSTOMER(FIRSTNAME)

is effectively the SQL query:

SELECT *, COUNT(*) FROM CUSTOMER GROUP BY CUSTOMER.*

By the way, there's a little bit of weirdness here. AGCA doesn't have precisely the same semantics for COUNT as SQL. See, every AGCA expression describes an infinite number of rows. Even CUSTOMER(FIRSTNAME) effectively has an infinite number of rows: There aren't any customers named Zardoz, but the expression does technically contain the row < Zardoz > -> 0. That said, in general, we're only interested in a few of those rows (the ones that aren't 0). In the interest of keeping things intuitive, I'm going to sweep this issue under the rug for now and come back to it in a future post.

That's it for now. Tune in next week for: JOINs and UNIONs in AGCA.

(If you want to learn more now, and have a good understanding of algebraic structures, have a look at the PODS2010 paper on AGCA: "Incremental Query Evaluation in a Ring of Databases")