April 13, 2021
Support Aggregation
So where's the challenge?
Test queries will be posted tonight...
Try them!
(e.g., REGEXP_EXTRACT(target, "1(3{2})7", 1))
UnresolvedFunction(
name = FunctionIdentifier("REGEXP_EXTRACT"),
arguments = Seq(
UnresolvedAttribute(Seq("target")),
Literal("1(3{2})7", StringType),
Literal(1, IntegerType)
),
distinct = false,
filter = None,
ignoreNulls = false
)
Replace like UnresolvedAlias, UnresolvedAttribute
... but with what?
case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
{
val builder =
FunctionRegistry.builtin
.lookupFunctionBuilder(name)
.getOrElse {
throw new RuntimeException(
s"Unable to resolve function `${name}`"
)
}
builder(arguments) // returns the replacement expression node.
}
val builder = FunctionRegistry.builtin
.lookupFunctionBuilder("REGEXP_EXTRACT").get
builder(
Attribute("target"),
Literal("1(3{2})7", StringType),
Literal(1, IntegerType)
)
↓
RegExpExtract(
Attribute("target"),
Literal("1(3{2})7", StringType),
1
)
val builder = FunctionRegistry.builtin
.lookupFunctionBuilder("SUM").get
builder(
Attribute("target")
)
↓
Sum(
Attribute("target")
)
An expressions subclassing:
|
|
↓ | ↓ |
Project (or Aggregate?) |
Aggregate |
SELECT REGEXP_EXTRACT(...) FROM R
vs
SELECT SUM(...) FROM R
How does the parser distinguish these cases?
It doesn't
SELECT SUM(A) FROM R
↓
Project(Seq(
UnresolvedFunction("SUM", Seq(
UnresolvedAttribute("A")
))
), ...)
Project(Seq(
UnresolvedFunction("SUM", Seq(
UnresolvedAttribute("A")
))
), ...)
↓
Project(Seq(
Sum(Attribute("A"))
), ...)
Now you can tell it's an aggregate.
Basic Guideline: If any expression is an AggregateFunction, the entire Project node should be an Aggregate instead.
Project(targets, child) =>
Aggregate(Seq(), targets, child)
Aggregate(
groupingExpressions: Seq[Expression],
aggregateExpressions: Seq[NamedExpression],
child: LogicalPlan
)
Field | Spark | This Project |
---|---|---|
groupingExpressions | Any Expression | Just Attributes |
aggregateExpressions | Any Expression | Attribute OR Alias(AggregateFunction(...)) |
Supporting everything Spark supports
will be a lot more work.
What does the accumulator look like for each aggregate?
AggregationFunction.aggBufferAttributes
The attributes that the aggregation function is requesting.
Allocate an InternalRow
with this schema for each function.
↙ To be Discussed ↘
↖ What Spark Does ↗
Everything is an Expression
How to manage Unresolved Attributes?
Adjust based on how you implemented your aggregation buffer.
Return to Transactions