CSE-4/562 Spring 2021 - Checkpoint 1

Checkpoint 1

CSE-4/562 Spring 2021

February 16, 2021

Garcia-Molina/Ullman/Widom: Ch. 16.1

News

  • Checkpoint 0 due tonight
  • Homework 1 will be posted tomorrow night

Checkpoint 1

sif$ scalac -cp build:Catalyzer.jar -jar submission.jar {all .scala files} sif$ ls data/ R.data      S.data     T.data sif$ head -n 2 data/R.data 1|3|5 2|9|1 sif$ scala -cp submission.jar:Catalyzer.jar microbase.Microbase $> CREATE TABLE R(A int, B int, C int) USING csv OPTIONS(path 'data/R.data', delimiter '|'); $> SELECT A, C FROM R WHERE B < 5; 1|5 ... $>

Checkpoint 1

  • Your code is compiled just the same as in Checkpoint 0.
  • Print a prompt '$>' at the start and after each command.
  • Read one command per line.
  • CREATE TABLE statements tell you the schema of each table.
  • Data lives in a '|'-separated file in 'data/[tablename].data'
  • Print query results '|'-separated

Setup

Add this to your build.sbt (modify as appropriate for your IDE)

  resolvers += "MimirDB" at "https://maven.mimirdb.info/"
  libraryDependencies += "edu.buffalo.cse.odin" %% "catalyzer" % "3.0"
    

Docs: https://doc.odin.cse.buffalo.edu/catalyzer/

Code: https://gitlab.odin.cse.buffalo.edu/okennedy/catalyzer

→ SQL
CREATE TABLE PLAYERS(
  ID string, 
  FIRSTNAME string, 
  LASTNAME string, 
  FIRSTSEASON int, 
  LASTSEASON int, 
  WEIGHT int, 
  BIRTHDATE date
);

SELECT FIRSTNAME, LASTNAME, WEIGHT, BIRTHDATE 
FROM PLAYERS WHERE WEIGHT>200;
→ SQL

import org.apache.spark.sql.execution.SparkSqlParser
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

...
  val parser = new SparkSqlParser()

  def parseSql(sql: String): LogicalPlan = 
    parser.parsePlan(sql)
...
    
→ SQL → ?

import org.apache.spark.sql.catalyst.plans.logical.CreateTableStatement
...
  plan match {
    case c:CreateTableStatement => 
      /* do something with c.tableName, c.tableSchema, c.location */
    case _ => 
      /* Interpret plan like a query */
  }
...
    
→ SQL CREATE TABLE SELECT
CREATE TABLE PLAYERS(
  ID string, 
  FIRSTNAME string, 
  LASTNAME string, 
  FIRSTSEASON int, 
  LASTSEASON int, 
  WEIGHT int, 
  BIRTHDATE date
) USING csv OPTIONS(path 'data/PLAYERS.data', delimiter '|');

There is a table named "PLAYERS"...

  • ... with 7 attributes
  • ... who's attributes have the given types
  • ... with data in the file "data/PLAYERS.data"

→ SQL CREATE TABLE Saved Schema .data SELECT
ABDELAL01|Alaa|Abdelnaby|1990|1994|240|1968-06-24
ABDULKA01|Kareem|Abdul-jabbar|1969|1988|225|1947-04-16
ABDULMA01|Mahmo|Abdul-rauf|1990|2000|162|1969-03-09
ABDULTA01|Tariq|Abdul-wahad|1997|2002|223|1974-11-03
ABDURSH01|Shareef|Abdur-rahim|1996|2007|225|1976-12-11
ABERNTO01|Tom|Abernethy|1976|1980|220|1954-05-06
ABRAMJO01|John|Abramovic|1946|1947|195|1919-02-09
ACKERAL01|Alex|Acker|2005|2008|185|1983-01-21
ACKERDO01|Donald|Ackerman|1953|1953|183|1930-09-04
ACRESMA01|Mark|Acres|1987|1992|220|1962-11-15
ACTONCH01|Charles|Acton|1967|1967|210|1942-01-11
...   
→ SQL CREATE TABLE Saved Schema .data SELECT Results

Example Queries

  1. SELECT A, B, ... FROM R (Project)
  2. SELECT A, B, ... FROM R WHERE ... (Project+Filter)
  3. SELECT A+B AS C, ... FROM R (Map)
  4. SELECT A+B AS C, ... FROM R WHERE ... (Map+Filter)
  5. SELECT SUM(A+B) AS C, ... FROM R (Aggregate)
  6. SELECT SUM(A+B) AS C, ... FROM R WHERE ... (Aggregate+Filter)
→ SQL CREATE TABLE Saved Schema .data SELECT Results

Spark's Workflow

  1. Analysis
  2. Optimization
  3. Physical Planning
  4. Code Generation
  5. Execution
→ SQL CREATE TABLE Saved Schema .data SELECT LogicalPlan Analyzed Plan Iterators Results

Analysis

Resolution
  • Replace placeholder values from parsing.
  • "Wire up" attributes between operators.
Validation
  • Ensure all of the types line up.

Placeholders


  case class UnresolvedRelation(
    nameElements: Seq[String], 
    options: CaseInsensitiveStringMap, 
    isStreaming: Boolean
  )
    

Separation of concerns: The parser doesn't know what tables have been defined.

Try It


    println(
      parser.parsePlan("SELECT * FROM R").treeString
    )
  

'Project [*]
+- 'UnresolvedRelation [R], [], false
  

  Project(Seq(UnresolvedStar(None)), 
    UnresolvedRelation(Seq("R"), CaseInsensitiveStringMap.empty, false)
  )
    

The interesting thing here is the nameElements field
(Seq("R") above)

This is a sequence to handle multipart names
(e.g., source.tableSeq("source", "table"))

Replacing Placeholders


  import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
  ...
  plan.transform { 
    case UnresolvedRelation(nameElements, _, _) => ???
  }
    

By the way, ??? is valid Scala.
It means "I haven't implemented this yet".

So what goes there?

Suggested Approach


  import org.apache.spark.sql.catalyst.plans.logical.LeafNode
  import org.apache.spark.sql.catalyst.expressions.AttributeSequence

  class Table( /* parameters */ ) extends LeafNode
  {
    def output: AttributeSequence = ???
  }
    

AttributeSequence is a subclass of Seq[Attribute]

Expressions

org.apache.spark.sql.catalyst.analysis.UnresolvedStar

Seq(AttributeReference)

org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

AttributeReference


  case class AttributeReference(
      name: String,
      dataType: DataType,
      nullable: Boolean = true,
      override val metadata: Metadata = Metadata.empty
    )(
      val exprId: ExprId = NamedExpression.newExprId,
      val qualifier: Seq[String] = Seq.empty[String]
    ) extends Attribute with Unevaluable { ... }
    

Simple Constructor: AttributeReference(name, dt)()

ExprId Constructor: AttributeReference(name, dt)(id)

exprId


    AttributeReference("a", IntType)().equals(
      AttributeReference("a", IntType)())
  

returns false

Spark uses exprId equivalence to check whether two attributes are the same.

a#1L $\neq$ a#2L

exprId


    val id = NamedExpression.newExprId
    AttributeReference("a", IntType)(id).equals(
      AttributeReference("a", IntType)(id))
  

returns true

qualifiers


    AttributeReference("foo", IntType)(
      qualifier = Seq("bar")
    )
    

represents bar.foo

You don't need to use this, but Spark already uses it, and it helps during analysis.

Why Do Analysis?

  • Makes output work automagically on all existing LogicalPlan nodes.
  • Makes dataType work automagically on all existing Expression nodes.
  • Makes it easier to support eval on all existing Expression nodes.

Evaluating Expressions


  import org.apache.spark.sql.catalyst.InternalRow
  ...
  def eval(input: InternalRow): Any = ???  
    

Input: An InternalRow

Output: The result of evaluating the expression

eval is implemented for most existing Spark Expression nodes.


  case class AttributeReference(
      name: String,
      dataType: DataType,
      nullable: Boolean = true,
      override val metadata: Metadata = Metadata.empty
    )(
      val exprId: ExprId = NamedExpression.newExprId,
      val qualifier: Seq[String] = Seq.empty[String]
    ) extends Attribute with Unevaluable { ... }
    

Unevaluable, huh?


  abstract class InternalRow extends SpecializedGetters with Serializable {
    boolean getBoolean(int ordinal);
    byte getByte(int ordinal);
    short getShort(int ordinal);
    int getInt(int ordinal);
    long getLong(int ordinal);
    float getFloat(int ordinal);
    double getDouble(int ordinal);
    ...
  }
    

InternalRow is basically just an Array[Any]

AttributeReference doesn't know which position the attribute will be at.

Suggested Approach

Make your own RowLookup subclass of Expression

Why not use this class instead of AttributeReference in the first place?

Once we start optimizing, optimization rules (e.g., Projection Pushdown) can change an attribute's position.

Evaluating LogicalPlan nodes

Separation of concerns: Look at each LogicalPlan node individually.

Naive approach: Compute the full result.


  def eval(plan: LogicalPlan): Seq[InternalRow] =
    plan match { 
      case Project(targets, child) => 
        evalProject(targets, eval(child))
      case Filter(condition, child) => 
        evalFilter(targets, eval(child))
      ...
    }

  def evalProject(targets: Seq[Expression], 
                  table: Seq[InternalRow]): Seq[InternalRow] = ???
  def evalFilter(condition: Expression, 
                 table: Seq[InternalRow]): Seq[InternalRow] = ???
  ...
    

Basic Mindset


  r = readCSVFile("R")
  s = readCSVFile("S")
  
  temp1 = evalCrossProduct(r, s)
  temp2 = evalFilter({R.B=S.B}, temp1)
  
  temp3 = evalFilter({S.C=10}, temp2)
  
  result = evalProject(Seq( {R.A} ), 
                       temp2)
    

Filter

$$\sigma_{A \neq 3} R$$

AB
12
34
56

Filter


  def evalFilter(condition: Expression, input: Seq[InternalRow]) =
      input.filter { row =>
        condition.eval(row).asInstanceOf[Boolean] 
      }
    

(All-At-Once)

Problem: A "table" can get very very big.

Better Idea: Iterators

hasNext()
Returns true if there are more rows to return
next()
Returns the next row
reset()
Resets the iterator back to the first row

All "functions" can be implemented as iterators that use constant memory

Select

$$\sigma_{A \neq 3} R$$

AB
getNext()for row in input:
12return row;
getNext()for row in input:
34X
56return row;
getNext()for row in input:
Nonereturn None;

Hint: Scala makes working with iterators very easy

Joins

Example: Join (Naive)


      for r in R:
        for s in S:
          emit(merge(r, s))
    

Project challenge: Implement this as an iterator

'|'-separated Value File Suggestions

  • Use Scala's scala.io.Source's lineIterator() method
  • Use String's split() to separate fields.
  • Parse everything upfront
  • Use InternalRow.fromSeq to create rows.
  • For Codd's sake, don't store entire tables in memory
SQL TypeSpark typeScala Type
stringStringTypeUTF8String
intIntTypeInteger
floatFloatTypeFloat
decimalDoubleTypeDouble
dateDateTypejava.time.Date

it's org.apache.spark.unsafe.types.UTF8String

Questions?

Next time...

Algorithms for Basic RA