February 16, 2021
resolvers += "MimirDB" at "https://maven.mimirdb.info/"
libraryDependencies += "edu.buffalo.cse.odin" %% "catalyzer" % "3.0"
Docs: https://doc.odin.cse.buffalo.edu/catalyzer/
Code: https://gitlab.odin.cse.buffalo.edu/okennedy/catalyzer
CREATE TABLE PLAYERS(
ID string,
FIRSTNAME string,
LASTNAME string,
FIRSTSEASON int,
LASTSEASON int,
WEIGHT int,
BIRTHDATE date
);
SELECT FIRSTNAME, LASTNAME, WEIGHT, BIRTHDATE
FROM PLAYERS WHERE WEIGHT>200;
import org.apache.spark.sql.execution.SparkSqlParser
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
...
val parser = new SparkSqlParser()
def parseSql(sql: String): LogicalPlan =
parser.parsePlan(sql)
...
import org.apache.spark.sql.catalyst.plans.logical.CreateTableStatement
...
plan match {
case c:CreateTableStatement =>
/* do something with c.tableName, c.tableSchema, c.location */
case _ =>
/* Interpret plan like a query */
}
...
CREATE TABLE PLAYERS(
ID string,
FIRSTNAME string,
LASTNAME string,
FIRSTSEASON int,
LASTSEASON int,
WEIGHT int,
BIRTHDATE date
) USING csv OPTIONS(path 'data/PLAYERS.data', delimiter '|');
↓
There is a table named "PLAYERS"...
ABDELAL01|Alaa|Abdelnaby|1990|1994|240|1968-06-24 ABDULKA01|Kareem|Abdul-jabbar|1969|1988|225|1947-04-16 ABDULMA01|Mahmo|Abdul-rauf|1990|2000|162|1969-03-09 ABDULTA01|Tariq|Abdul-wahad|1997|2002|223|1974-11-03 ABDURSH01|Shareef|Abdur-rahim|1996|2007|225|1976-12-11 ABERNTO01|Tom|Abernethy|1976|1980|220|1954-05-06 ABRAMJO01|John|Abramovic|1946|1947|195|1919-02-09 ACKERAL01|Alex|Acker|2005|2008|185|1983-01-21 ACKERDO01|Donald|Ackerman|1953|1953|183|1930-09-04 ACRESMA01|Mark|Acres|1987|1992|220|1962-11-15 ACTONCH01|Charles|Acton|1967|1967|210|1942-01-11 ...
case class UnresolvedRelation(
nameElements: Seq[String],
options: CaseInsensitiveStringMap,
isStreaming: Boolean
)
Separation of concerns: The parser doesn't know what tables have been defined.
println(
parser.parsePlan("SELECT * FROM R").treeString
)
↓
'Project [*]
+- 'UnresolvedRelation [R], [], false
Project(Seq(UnresolvedStar(None)),
UnresolvedRelation(Seq("R"), CaseInsensitiveStringMap.empty, false)
)
The interesting thing here is the nameElements field
(Seq("R") above)
This is a sequence to handle multipart names
(e.g., source.table→ Seq("source", "table"))
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
...
plan.transform {
case UnresolvedRelation(nameElements, _, _) => ???
}
By the way, ??? is valid Scala.
It means "I haven't implemented this yet".
So what goes there?
import org.apache.spark.sql.catalyst.plans.logical.LeafNode
import org.apache.spark.sql.catalyst.expressions.AttributeSequence
class Table( /* parameters */ ) extends LeafNode
{
def output: AttributeSequence = ???
}
AttributeSequence is a subclass of Seq[Attribute]
org.apache.spark.sql.catalyst.analysis.UnresolvedStar
↓
Seq(AttributeReference)
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
↓
AttributeReference
case class AttributeReference(
name: String,
dataType: DataType,
nullable: Boolean = true,
override val metadata: Metadata = Metadata.empty
)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Seq[String] = Seq.empty[String]
) extends Attribute with Unevaluable { ... }
Simple Constructor: AttributeReference(name, dt)()
ExprId Constructor: AttributeReference(name, dt)(id)
AttributeReference("a", IntType)().equals(
AttributeReference("a", IntType)())
returns false
Spark uses exprId equivalence to check whether two attributes are the same.
a#1L $\neq$ a#2L
val id = NamedExpression.newExprId
AttributeReference("a", IntType)(id).equals(
AttributeReference("a", IntType)(id))
returns true
AttributeReference("foo", IntType)(
qualifier = Seq("bar")
)
represents bar.foo
You don't need to use this, but Spark already uses it, and it helps during analysis.
import org.apache.spark.sql.catalyst.InternalRow
...
def eval(input: InternalRow): Any = ???
Input: An InternalRow
Output: The result of evaluating the expression
eval is implemented for most existing Spark Expression nodes.
case class AttributeReference(
name: String,
dataType: DataType,
nullable: Boolean = true,
override val metadata: Metadata = Metadata.empty
)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Seq[String] = Seq.empty[String]
) extends Attribute with Unevaluable { ... }
Unevaluable, huh?
abstract class InternalRow extends SpecializedGetters with Serializable {
boolean getBoolean(int ordinal);
byte getByte(int ordinal);
short getShort(int ordinal);
int getInt(int ordinal);
long getLong(int ordinal);
float getFloat(int ordinal);
double getDouble(int ordinal);
...
}
InternalRow is basically just an Array[Any]
AttributeReference doesn't know which position the attribute will be at.
Make your own RowLookup subclass of Expression
Why not use this class instead of AttributeReference in the first place?
Once we start optimizing, optimization rules (e.g., Projection Pushdown) can change an attribute's position.
Separation of concerns: Look at each LogicalPlan node individually.
Naive approach: Compute the full result.
def eval(plan: LogicalPlan): Seq[InternalRow] =
plan match {
case Project(targets, child) =>
evalProject(targets, eval(child))
case Filter(condition, child) =>
evalFilter(targets, eval(child))
...
}
def evalProject(targets: Seq[Expression],
table: Seq[InternalRow]): Seq[InternalRow] = ???
def evalFilter(condition: Expression,
table: Seq[InternalRow]): Seq[InternalRow] = ???
...
r = readCSVFile("R")
s = readCSVFile("S")
temp1 = evalCrossProduct(r, s)
temp2 = evalFilter({R.B=S.B}, temp1)
temp3 = evalFilter({S.C=10}, temp2)
result = evalProject(Seq( {R.A} ),
temp2)
$$\sigma_{A \neq 3} R$$
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
def evalFilter(condition: Expression, input: Seq[InternalRow]) =
input.filter { row =>
condition.eval(row).asInstanceOf[Boolean]
}
(All-At-Once)
Problem: A "table" can get very very big.
All "functions" can be implemented as iterators that use constant memory
$$\sigma_{A \neq 3} R$$
A | B | |
---|---|---|
getNext() | for row in input: | |
1 | 2 | return row; |
getNext() | for row in input: | |
3 | 4 | X |
5 | 6 | return row; |
getNext() | for row in input: | |
None | return None; |
Hint: Scala makes working with iterators very easy
for r in R:
for s in S:
emit(merge(r, s))
Project challenge: Implement this as an iterator
SQL Type | Spark type | Scala Type |
---|---|---|
string | StringType | UTF8String |
int | IntType | Integer |
float | FloatType | Float |
decimal | DoubleType | Double |
date | DateType | java.time.Date |
it's org.apache.spark.unsafe.types.UTF8String
Algorithms for Basic RA