- Checkpoint 3

Checkpoint 3

April 13, 2021

Goal

Support Aggregation

Catalyzer provides...

  • Fold operators for all standard aggregation functions.
  • Aggregate logical plan operator

So where's the challenge?

Test queries will be posted tonight...

Try them!

Sparkisms

  • New Placeholders
  • Interpreting the Aggregate operator
  • Interfacing with Spark's fold operators

New Placeholders

UnresolvedFunction

(e.g., REGEXP_EXTRACT(target, "1(3{2})7", 1))


      UnresolvedFunction(
        name = FunctionIdentifier("REGEXP_EXTRACT"), 
        arguments = Seq(
          UnresolvedAttribute(Seq("target")),
          Literal("1(3{2})7", StringType),
          Literal(1, IntegerType)
        ),
        distinct = false,
        filter = None,
        ignoreNulls = false
      )
    

Replace like UnresolvedAlias, UnresolvedAttribute

... but with what?

FunctionRegistry


      case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
      {
        val builder = 
          FunctionRegistry.builtin
            .lookupFunctionBuilder(name)
            .getOrElse {
              throw new RuntimeException(
                s"Unable to resolve function `${name}`"
              )
            }
        builder(arguments) // returns the replacement expression node.
      }
    

Functions


      val builder = FunctionRegistry.builtin
                      .lookupFunctionBuilder("REGEXP_EXTRACT").get
      builder(
        Attribute("target"),
        Literal("1(3{2})7", StringType),
        Literal(1, IntegerType)
      )
    


      RegExpExtract(
        Attribute("target"),
        Literal("1(3{2})7", StringType),
        1
      )
    
RegExpExtract

Functions


      val builder = FunctionRegistry.builtin
                      .lookupFunctionBuilder("SUM").get
      builder(
        Attribute("target")
      )
    


      Sum(
        Attribute("target")
      )
    
Sum

Aggregates

An expressions subclassing:

New Placeholders

Parsing SQL


  SELECT ... FROM R
  WHERE ...
        

  SELECT ... FROM R
  GROUP BY ...
        
Project
(or Aggregate?)
Aggregate

      SELECT REGEXP_EXTRACT(...) FROM R
    

vs


      SELECT SUM(...) FROM R
    

How does the parser distinguish these cases?

It doesn't


      SELECT SUM(A) FROM R
    


      Project(Seq(
        UnresolvedFunction("SUM", Seq(
          UnresolvedAttribute("A")
        ))
      ), ...)
    

      Project(Seq(
        UnresolvedFunction("SUM", Seq(
          UnresolvedAttribute("A")
        ))
      ), ...)
    


      Project(Seq(
        Sum(Attribute("A"))
      ), ...)
    

Now you can tell it's an aggregate.

Basic Guideline: If any expression is an AggregateFunction, the entire Project node should be an Aggregate instead.


      Project(targets, child) => 
        Aggregate(Seq(), targets, child)
    

      Aggregate(
        groupingExpressions: Seq[Expression], 
        aggregateExpressions: Seq[NamedExpression], 
        child: LogicalPlan
      )
    
groupingExpressions
GROUP BY expressions
aggregateExpressions
SELECT target expressions (may include GROUP BY)
child
as usual
Field Spark This Project
groupingExpressions Any Expression Just Attributes
aggregateExpressions Any Expression Attribute OR Alias(AggregateFunction(...))

Supporting everything Spark supports
will be a lot more work.

  1. Assign input tuple to a group based on groupingAttributes
  2. Accumulate for each aggregate in aggregateExpressions
  3. Repeat for all input tuples
  4. "Render" result based on aggregateExpressions

AggregateFunction

COUNT(*)

Init
$0$
Fold(Accum, New)
$Accum + 1$

SUM(A)

Init
$0$
Fold(Accum, New)
$Accum + New$

AVG(A)

Init
$\{ sum = 0, count = 0 \}$
Fold(Accum, New)
$\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
Finalize(Accum)
$\frac{Accum.sum}{Accum.count}$

Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.

What does the accumulator look like for each aggregate?

Aggregation Buffers

AggregationFunction.aggBufferAttributes

The attributes that the aggregation function is requesting.

Allocate an InternalRow
with this schema for each function.

Aggregation Buffers

  • One Buffer Per Aggregate, Per Group
  • One Buffer Per Group (Aggregates Share)

↙ To be Discussed ↘

↖ What Spark Does ↗

DeclarativeAggregates

Everything is an Expression

initialValues: Seq[Expression]
Evaluate these expressions without a row to get values for the buffer
updateExpressions: Seq[Expression]
Evaluate these expressions on the buffer and input together to get new buffer values
evaluateExpressions: Expression
Evaluate this expressions on the buffer to get the final aggregate result

DeclarativeAggregates

updateExpressions

How to manage Unresolved Attributes?

Input
[Buffer, InputRow]
Schema (for resolution)
agg.aggBufferAttributes
   ++ child.output

Adjust based on how you implemented your aggregation buffer.

DeclarativeAggregates

evaluateExpression

Input
Buffer
Schema (for resolution)
agg.aggBufferAttributes

Next Class

Return to Transactions