Checkpoint 1

February 16, 2021

Garcia-Molina/Ullman/Widom: Ch. 16.1

News

Checkpoint 0 due tonight
Homework 1 will be posted tomorrow night

Checkpoint 1

Your code is compiled just the same as in Checkpoint 0.
Print a prompt '$>' at the start and after each command.
Read one command per line.
CREATE TABLE statements tell you the schema of each table.
Data lives in a '|'-separated file in 'data/[tablename].data'
Print query results '|'-separated

Setup

Add this to your build.sbt (modify as appropriate for your IDE)


  resolvers += "MimirDB" at "https://maven.mimirdb.info/"
  libraryDependencies += "edu.buffalo.cse.odin" %% "catalyzer" % "3.0"

Docs: https://doc.odin.cse.buffalo.edu/catalyzer/

Code: https://gitlab.odin.cse.buffalo.edu/okennedy/catalyzer

CREATE TABLE PLAYERS(
  ID string, 
  FIRSTNAME string, 
  LASTNAME string, 
  FIRSTSEASON int, 
  LASTSEASON int, 
  WEIGHT int, 
  BIRTHDATE date
);

SELECT FIRSTNAME, LASTNAME, WEIGHT, BIRTHDATE 
FROM PLAYERS WHERE WEIGHT>200;


import org.apache.spark.sql.execution.SparkSqlParser
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

...
  val parser = new SparkSqlParser()

  def parseSql(sql: String): LogicalPlan = 
    parser.parsePlan(sql)
...


import org.apache.spark.sql.catalyst.plans.logical.CreateTableStatement
...
  plan match {
    case c:CreateTableStatement => 
      /* do something with c.tableName, c.tableSchema, c.location */
    case _ => 
      /* Interpret plan like a query */
  }
...

CREATE TABLE PLAYERS(
  ID string, 
  FIRSTNAME string, 
  LASTNAME string, 
  FIRSTSEASON int, 
  LASTSEASON int, 
  WEIGHT int, 
  BIRTHDATE date
) USING csv OPTIONS(path 'data/PLAYERS.data', delimiter '|');

↓

There is a table named "PLAYERS"...

... with 7 attributes
... who's attributes have the given types
... with data in the file "data/PLAYERS.data"

ABDELAL01|Alaa|Abdelnaby|1990|1994|240|1968-06-24
ABDULKA01|Kareem|Abdul-jabbar|1969|1988|225|1947-04-16
ABDULMA01|Mahmo|Abdul-rauf|1990|2000|162|1969-03-09
ABDULTA01|Tariq|Abdul-wahad|1997|2002|223|1974-11-03
ABDURSH01|Shareef|Abdur-rahim|1996|2007|225|1976-12-11
ABERNTO01|Tom|Abernethy|1976|1980|220|1954-05-06
ABRAMJO01|John|Abramovic|1946|1947|195|1919-02-09
ACKERAL01|Alex|Acker|2005|2008|185|1983-01-21
ACKERDO01|Donald|Ackerman|1953|1953|183|1930-09-04
ACRESMA01|Mark|Acres|1987|1992|220|1962-11-15
ACTONCH01|Charles|Acton|1967|1967|210|1942-01-11
...

Example Queries

SELECT A, B, ... FROM R (Project)
SELECT A, B, ... FROM R WHERE ... (Project+Filter)
SELECT A+B AS C, ... FROM R (Map)
SELECT A+B AS C, ... FROM R WHERE ... (Map+Filter)
SELECT SUM(A+B) AS C, ... FROM R (Aggregate)
SELECT SUM(A+B) AS C, ... FROM R WHERE ... (Aggregate+Filter)

Spark's Workflow

Analysis
Optimization
Physical Planning
Code Generation
Execution

Analysis

Resolution: Replace placeholder values from parsing.

"Wire up" attributes between operators.
Validation: Ensure all of the types line up.

Placeholders


  case class UnresolvedRelation(
    nameElements: Seq[String], 
    options: CaseInsensitiveStringMap, 
    isStreaming: Boolean
  )

Separation of concerns: The parser doesn't know what tables have been defined.

Try It


    println(
      parser.parsePlan("SELECT * FROM R").treeString
    )

↓


'Project [*]
+- 'UnresolvedRelation [R], [], false


  Project(Seq(UnresolvedStar(None)), 
    UnresolvedRelation(Seq("R"), CaseInsensitiveStringMap.empty, false)
  )

The interesting thing here is the nameElements field
(Seq("R") above)

This is a sequence to handle multipart names
(e.g., source.table→ Seq("source", "table"))

Replacing Placeholders


  import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
  ...
  plan.transform { 
    case UnresolvedRelation(nameElements, _, _) => ???
  }

By the way, ??? is valid Scala.
It means "I haven't implemented this yet".

So what goes there?

Suggested Approach


  import org.apache.spark.sql.catalyst.plans.logical.LeafNode
  import org.apache.spark.sql.catalyst.expressions.AttributeSequence

  class Table( /* parameters */ ) extends LeafNode
  {
    def output: AttributeSequence = ???
  }

AttributeSequence is a subclass of Seq[Attribute]

Expressions

org.apache.spark.sql.catalyst.analysis.UnresolvedStar

↓

Seq(AttributeReference)

org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

↓

AttributeReference


  case class AttributeReference(
      name: String,
      dataType: DataType,
      nullable: Boolean = true,
      override val metadata: Metadata = Metadata.empty
    )(
      val exprId: ExprId = NamedExpression.newExprId,
      val qualifier: Seq[String] = Seq.empty[String]
    ) extends Attribute with Unevaluable { ... }

Simple Constructor: AttributeReference(name, dt)()

ExprId Constructor: AttributeReference(name, dt)(id)

exprId


    AttributeReference("a", IntType)().equals(
      AttributeReference("a", IntType)())

returns false

Spark uses exprId equivalence to check whether two attributes are the same.

a#1L $\neq$ a#2L

exprId


    val id = NamedExpression.newExprId
    AttributeReference("a", IntType)(id).equals(
      AttributeReference("a", IntType)(id))

returns true

qualifiers


    AttributeReference("foo", IntType)(
      qualifier = Seq("bar")
    )

represents bar.foo

You don't need to use this, but Spark already uses it, and it helps during analysis.

Why Do Analysis?

Makes output work automagically on all existing LogicalPlan nodes.
Makes dataType work automagically on all existing Expression nodes.
Makes it easier to support eval on all existing Expression nodes.

Evaluating Expressions


  import org.apache.spark.sql.catalyst.InternalRow
  ...
  def eval(input: InternalRow): Any = ???

Input: An InternalRow

Output: The result of evaluating the expression

eval is implemented for most existing Spark Expression nodes.


  case class AttributeReference(
      name: String,
      dataType: DataType,
      nullable: Boolean = true,
      override val metadata: Metadata = Metadata.empty
    )(
      val exprId: ExprId = NamedExpression.newExprId,
      val qualifier: Seq[String] = Seq.empty[String]
    ) extends Attribute with Unevaluable { ... }

Unevaluable, huh?


  abstract class InternalRow extends SpecializedGetters with Serializable {
    boolean getBoolean(int ordinal);
    byte getByte(int ordinal);
    short getShort(int ordinal);
    int getInt(int ordinal);
    long getLong(int ordinal);
    float getFloat(int ordinal);
    double getDouble(int ordinal);
    ...
  }

InternalRow is basically just an Array[Any]

AttributeReference doesn't know which position the attribute will be at.

Suggested Approach

Make your own RowLookup subclass of Expression

Why not use this class instead of AttributeReference in the first place?

Once we start optimizing, optimization rules (e.g., Projection Pushdown) can change an attribute's position.

Evaluating LogicalPlan nodes

Separation of concerns: Look at each LogicalPlan node individually.

Naive approach: Compute the full result.


  def eval(plan: LogicalPlan): Seq[InternalRow] =
    plan match { 
      case Project(targets, child) => 
        evalProject(targets, eval(child))
      case Filter(condition, child) => 
        evalFilter(targets, eval(child))
      ...
    }

  def evalProject(targets: Seq[Expression], 
                  table: Seq[InternalRow]): Seq[InternalRow] = ???
  def evalFilter(condition: Expression, 
                 table: Seq[InternalRow]): Seq[InternalRow] = ???
  ...

Basic Mindset


  r = readCSVFile("R")
  s = readCSVFile("S")
  
  temp1 = evalCrossProduct(r, s)
  temp2 = evalFilter({R.B=S.B}, temp1)
  
  temp3 = evalFilter({S.C=10}, temp2)
  
  result = evalProject(Seq( {R.A} ), 
                       temp2)

Filter

$$\sigma_{A \neq 3} R$$

A	B
1	2
3	4
5	6

Filter


  def evalFilter(condition: Expression, input: Seq[InternalRow]) =
      input.filter { row =>
        condition.eval(row).asInstanceOf[Boolean] 
      }

(All-At-Once)

Problem: A "table" can get very very big.

Better Idea: Iterators

hasNext(): Returns true if there are more rows to return
next(): Returns the next row
reset(): Resets the iterator back to the first row

All "functions" can be implemented as iterators that use constant memory

Select

$$\sigma_{A \neq 3} R$$

A	B
`getNext()`		`for row in input:`
1	2	`return row;`
`getNext()`		`for row in input:`
3	4	X
5	6	`return row;`
`getNext()`		`for row in input:`
`None`		`return None;`

Hint: Scala makes working with iterators very easy

Joins

Example: Join (Naive)


      for r in R:
        for s in S:
          emit(merge(r, s))

Project challenge: Implement this as an iterator

'|'-separated Value File Suggestions

Use Scala's scala.io.Source's lineIterator() method
Use String's split() to separate fields.
Parse everything upfront
Use InternalRow.fromSeq to create rows.
For Codd's sake, don't store entire tables in memory

SQL Type	Spark type	Scala Type
string	StringType	UTF8String
int	IntType	Integer
float	FloatType	Float
decimal	DoubleType	Double
date	DateType	java.time.Date

it's org.apache.spark.unsafe.types.UTF8String

Questions?

Next time...

Algorithms for Basic RA