Last week, I started talking about the AGCA (Aggregate Calculus) query language. AGCA is a query languages that makes an explicit distinction between the parts of data that can be easily managed incrementally, and the parts that can not. This makes it incredibly useful for incremental computation techniques like the viewlet transform.
At the heart of AGCA is a simple mantra that I harped on extensively last week: Everything is multiplicities. If I write down a query, such as:
CUSTOMER(FIRSTNAME)
What I will get back is a table with one row for each distinct customer first name.
__FIRSTNAME______#__
< John > -> 8
< Alphonse > -> 1
The table has two columns: the FIRSTNAME column, and a count, or multiplicity for each row. In the example above, I have 8 customers named John, and 1 named Alphonse.
There's actually another way of looking at CUSTOMER(FIRSTNAME). You can think of it as a function (for those who are interested, it's actually something called a monad). If I give it a value for FIRSTNAME, it gives me back the number of customers who have that particular first name.
And this leads me to this week's topic: JOINs (definition here) and UNIONs (specifically what's known as a "Bag Union", defined briefly here). As we'll see, there's a nice way of looking at these two common database operations.
For those unfamiliar with the concept, a JOIN between two tables matches up rows of each table based on some rule. For example, I might have two tables with information about bicycles available for purchase: FRAMES(COLOR, TIRESIZE) and TIRES(TIRESIZE, TIRE). Here's some sample data (along with multiplicities… ignore those for now):
__COLOR____TIRESIZE______#__
< Blue , 26" > -> 1
< Red , 26" > -> 3
< Red , 20" > -> 1
< Black, 20" > -> 2
__TIRESIZE__TIRETYPE______#
< 26" , Mountain > -> 1
< 26" , Road > -> 1
< 20" , Road > -> 2
Let's say I'm interested in all the possible options I have for a new bike. In this case, I need to pick out a frame and a tire type. Clearly, I want to make sure that the tire I get is appropriate for the frame: The TIRESIZE of the tire has to match up with the TIRESIZE of the bike I get. So, to enumerate all the possible options, I can compute a JOIN between these two tables on the condition that the TIRESIZEs are identical (again, ignore the multiplicities for now):
__COLOR____TIRESIZE__TIRETYPE______#__
< Blue , 26" , Mountain > -> 1
< Blue , 26" , Road > -> 1
< Red , 26" , Mountain > -> 3
< Red , 26" , Road > -> 3
< Red , 20" , Road > -> 2
< Black , 20" , Road > -> 4
This is actually a special kind of join (condition) called a natural join; A natural join matches up rows based on columns that share the same name (in this case, the TIRESIZE column). As we'll see in a bit AGCA uses only natural joins. But, how exactly do joins work in AGCA?
Well, let's start by looking at those multiplicities in our input data. There is only a single type of Blue bike with 26" tires available for purchase, but two different types of 26" tires (Mountain and Road). Together these produce 2 different options for purchase (the first 2 rows of the output).
The multiplicity of the Red, 26" bike row is 3. What this means in practical terms is that there are 3 different types of Red, 26" bikes available. If I wanted one of those, I would actually have 6 different options (2 types of tire as before, and now 3 different types of bike).
The same thing happens with the 20" bikes. There are two different types of 20" tire (this time, both are road tires). There are also 2 different types of Black, 20" bikes. So, we have 4 different purchase options.
You've probably seen the pattern by now. When we JOIN two rows together, the multiplicities of the JOINed rows multiply. Because of this, JOINs in AGCA are written down as products. For example, we'd write down the above join as:
FRAMES(COLOR, TIRESIZE) * TIRES(TIRESIZE, TIRETYPE)
Note that the join condition is not explicitly specified in this query. This is because all joins in AGCA are natural joins. In this case, TIRESIZE appears in both FRAMES and TIRES, so we know that the query is only asking to match up the TIRESIZEs of both inputs.
Remember that I mentioned earlier that you can think of a table as a function (monad). This holds for any AGCA expression. The example join defines a function (monad) with three parameters: COLOR, TIRESIZE, and TIRETYPE. Let's say I wanted to evaluate it with the parameters (Black, 20", Road):
FRAMES(Black, 20") * TIRES(20", Road) = 2 * 2 = 4
Cool, right?
By the way, if there are no overlapping column names, then the natural join is just a cartesian cross product.
Ok, so, what about UNIONs? Well, let's say we have tables representing several different purveyors of coffee beans: ORENS(ROAST), CTB(ROAST) (respectively):
__ROAST_________#__
< Espresso > -> 2
< Light > -> 1
__ROAST______#__
< Medium > -> 2
< Light > -> 3
If I wanted to know all of my coffee-purchasing options, I could compute the union of these two tables. Recall that we're working with multisets/bags, so to keep things simple let's assume that there's no overlap between the offerings of both stores. If that's the case, then we can just merge the two tables together, adding together the multiplicities of rows that appear in both inputs:
__ROAST_________#__
< Espresso > -> 2
< Medium > -> 2
< Light > -> 4
In short, UNION adds row multiplicities together. Just like AGCA uses product to represent joins, sum represents unions. So our coffee shop query (let's call it Q) is written down as
Q(ROAST) := ORENS(ROAST) + CTB(ROAST)
Again, we can look at this query as a function. For example:
Q(Light) = ORENS(Light) + CTB(Light) = 1 + 3 = 4
There are some caveats here. First off, the column names of the things being UNIONed together typically have to be the same. ORENS(ROAST) + FRAMES(COLOR, TIRESIZE) doesn't really make much sense. I'll return to this assumption before long, but be aware of it for now.
Second, as I briefly mentioned last week, tables (and expressions in general) contain an infinite number of rows -- although only a small number have non-zero multiplicities. This is something to keep in mind when thinking about AGCA expressions as functions. For example:
Q(Espresso) = ORENS(Espresso) + CTB(Espresso) = 2 + 0 = 2
Or, going back to our bike options example:
Q(Black, 20", Mountain) = FRAMES(Black, 20") * TIRES(20", Mountain) = 2 * 0 = 0
Note how the zeroes propagate through joins. This makes sense if you think about it. JOINing against a row that doesn't exist doesn't produce any output rows (ignoring outer joins for the moment).
For those interested in algebraic structures, AGCA actually forms something called a Ring, with product and sum defined as above. I'll talk about constants in a few weeks, but you can think of the "zero" and "one" values of the ring as special tables ZERO() and ONE() with no columns, and only a single row each: <> -> 0, and <> -> 1, respectively. The set underlying the ring is a specific incarnation of the monads that I've been alluding to.
That's it for now. Next week: Two typically unrelated operations, brought together by AGCA in a somewhat surprising way: PROJECTion and the COUNT aggregate.