The ODIn Lab - AGCA, The language of change (Part 3: The Sum Project)

For a few weeks now, I've been talking about the AGCA query language, a language for incremental computation. If you haven't already done so, you should probably read Part 1 and Part 2 before continuing on with this post.

Just to recap, in AGCA, queries are written down as algebraic formulas. The most basic term in the language is the table (Relations, if you want to be fancy). Multiplication is the natural join, and addition is bag union.

And of course, the most important thing about AGCA is that Everything is Multiplicities. Unlike SQL, where the result of a query is simply a list of output rows, a query result in AGCA is more like a lookup table. Each row of the output is associated with a multiplicity (loosely speaking, the number of times that the row occurs in the result). Because the rows are unique, you can use the row to look up its multiplicity in the query result. To use the technical terms, query outputs are Maps (a.k.a., Hashes, Dictionaries, HashMaps, etc…), each row is a key in the map, and multiplicities are the mapped values.

Note by the way, that this doesn't stop us from talking about empty rows. Look at the following SQL query:

SELECT

FROM ACTORS;

Or equivalently, in English, "give me an empty row for every actor." There's a good chance your favorite SQL system won't approve of this query. You can also certainly make a good argument that this query isn't especially useful. That said, the query does have a meaning. Give me one row (with nothing in it) for every actor.

Recall one more thing from last week: How addition/union works in AGCA. "Duplicate" rows on either side are merged, and their multiplicities are added together. If a row occurs twice on one side of the union, and three times on the other, then the final unioned output has five copies of the row (or as AGCA would put it, the row has a multiplicity of 5 in the output).

Where am I going with this? Well, empty rows are all identical. So, if you have a result that contains only empty rows, the result is guaranteed to have exactly one row (or zero rows, that is, one row with multiplicity 0). Let's see an example on this table of actor first names.

__FIRSTNAME______#__

< Steve     > -> 2

< Jim       > -> 1

< John      > -> 3

Let's say we get rid of the FIRSTNAME column (project it away, to use the technical term). We end up with

________#__

<  > -> 2

<  > -> 1

<  > -> 3

But that's wrong. Every row is supposed to be unique. All those duplicate empty rows need to be merged together. So, just like we merge together rows when computing a UNION, we add up the multiplicities of these empty rows.

________#__

<  > -> 6

What exactly just happened here? Well, by projecting away the FIRSTNAME column, we've essentially computed the COUNT(*) of the number of rows in the input. Recall that when I first described AGCA, I mentioned that every query has an implicit COUNT(*) around it. Instead of

SELECT FROM ACTORS;

what AGCA actually computes is

SELECT COUNT(*) FROM (SELECT FROM ACTORS);

or, put more simply

SELECT COUNT(*) FROM ACTORS;

The same idea can actually be taken a bit further. Let's say you have the following table:

__FIRSTNAME__LASTNAME_________#__

< Steve ,    Carell      > -> 1

< Steve ,    Coogan      > -> 1

< Jim   ,    Carrey      > -> 1

< John  ,    Depp        > -> 1

< John  ,    Galecki     > -> 1

< John  ,    Rhys-Davies > -> 1

What happens if we project away just the LASTNAME column? We get 2 Steves, 1 Jim, and 3 Johns, exactly the same table that we started with. In other words, by projecting away just the LASTNAME column, we end up computing a group-by COUNT(*) aggregate:

SELECT FIRSTNAME, COUNT(*)

FROM ACTORS

GROUP BY FIRSTNAME;

This technique of using projection to compute the COUNT(*) aggregate also lets us compute group-by aggregates. Projection and the COUNT(*) aggregate are the same thing in AGCA. AGCA uses a special operator called AggSum to represent this operation. For example, the above group-by COUNT(*) aggregate is written as:

AggSum([FIRSTNAME], ACTORS(FIRSTNAME,LASTNAME))

Or in general:

AggSum([{group by var 1}, {group by var 2}, …], {aggregated AGCA expression})

And there you have it: How AGCA handles projection and the COUNT(*) aggregate. Next week, the SUM() aggregate, and conditions/selection in AGCA.