- Relational Algebra

Relational Algebra

February 9, 2021

Garcia-Molina/Ullman/Widom: Ch. 2.4, 5.1

The running theme...

Replace [Thing A] with better, but equivalent [Thing B].

  1. How can we tell if [Thing B] is equivalent to [Thing A]?
  2. How can we tell if [Thing B] is better than [Thing A]?

Replace [Query A] with better, but equivalent [Query B].

  1. How can we tell if [Query B] is equivalent to [Query A]?
  2. How can we tell if [Query B] is better than [Query A]?

... but first let's remind ourselves of few definitions.

Relational Data

Relation (Table)
A collection of Tuples of Primitive Values
All tuples have the same set of attributes, or schema

Relational Data

Types of relations...

Set

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'American linden''Brooklyn'
'London planetree''Brooklyn'
  
  

(Unique Only)

Bag

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'London planetree''Brooklyn'
'American linden''Brooklyn'
'London planetree''Brooklyn'
'honeylocust''Brooklyn'

List

SPC_COMMONBORONAME
'American linden''Brooklyn'
'honeylocust''Brooklyn'
'honeylocust''Brooklyn'
'London planetree''Brooklyn'
'London planetree''Brooklyn'

(Order Matters)

Declarative Languages

DeclarativeImperative
Say what you want Say how you want to get it
"Get me the TPS Reports"
Look at every T report 
For each week:
  Sum up the sprocket count
  Find that week's S report
  ... 
SQL, RA, R, … C, Scala, Java, Python, …

Declarative languages make it easier to explore equivalent computations to find the best one.

SQL


                      SELECT R.A FROM R, S 
                      WHERE R.B = S.B AND S.C = 10
    

Python


          with open_csv("R.csv") as R:
            with open_csv("S.csv") as S:
              for r_row in R:
                for s_row in S:
                  if r_row[1] == s_row[0] and s_row[1] == "10":
                    print(r_row[0])
    

Declarative Queries

SQL
Readability: Focus on what and where from
Relational Algebra
Data Flow: Focus on transformations
Relational Calculus
Minimality: Focus on attribute relationships

Relational Algebra

$$\pi_{R.A}(\sigma_{S.C=10}(\bowtie_{R.B = S.B}(\textbf{R}, \textbf{S})))$$

Relational Calculus

$$Q(A) :- R(A, B), S(B, 10)$$

Declarative Queries

SQL
Readability: Focus on what and where from
Relational Algebra
Data Flow: Focus on transformations
Relational Calculus
Minimality: Focus on attribute relationships

Preliminaries

We start with a database instance with a fixed schema

Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$

Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)

Relational Algebra

OperationSymMeaning
Selection$\sigma$Select a subset of the input rows
Projection$\pi$Delete unwanted columns
Cross-product$\times$Combine two relations
Set-difference$-$Tuples in Rel 1, but not Rel 2
Union$\cup$Tuples either in Rel 1 or in Rel 2

Also: Intersection, Join, Division, Renaming

(Not essential, but can be useful)

Input Query Language Output
Sets of Tuples $\rightarrow$ Set Relational Algebra $\rightarrow$ Set of Tuples
Bags of Tuples $\rightarrow$ Bag Relational Algebra $\rightarrow$ Bag of Tuples
Lists of Tuples $\rightarrow$ Extended Relational Algebra $\rightarrow$ List of Tuples

First we focus on sets and bags.

Selection ($\sigma_{c}$)

Delete rows that fail the condition $c$.

$$\sigma_{(\textbf{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$
TREE_IDSPC_COMMONBORONAME...
204026'honeylocust''Brooklyn'...
204337'honeylocust''Brooklyn'...
189565'American linden''Brooklyn'...
192755'London planetree''Brooklyn'...
189465'London planetree''Brooklyn'...
... and 177287 more

          SELECT * FROM Trees [[ WHERE BORONAME = 'Brooklyn' ]]
      

Projection ($\pi_{A}$)

Delete attributes not in the projection list $A$.

$$\pi_{\textbf{BORONAME}}(\textbf{Trees})$$
BORONAME
Queens
Brooklyn
Manhattan
Bronx
Staten Island

Only 5 results... not 683788?

Set and Bag Projection are different


              SELECT [[ DISTINCT BORONAME ]] FROM Trees;
      

Reminder: Queries are Relations

What are these queries' schemas?

$$\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees}$$
$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$
$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})}(\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees})$$

Union ($\cup$)

Takes two relations that are union-compatible...

(Both relations have the same number of fields with the same types)

... and returns all tuples appearing in either relation

$$(\sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees}) \cup (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees})$$

We use $\uplus$ if we explicitly mean bag union

Intersection ($\cap$)

Return all tuples appearing in both
of two union-compatible relations

$$(\pi_{SPC\_COMMON} ( \sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees}))\\ ~~~~~~~~~\cap (\pi_{SPC\_COMMON} (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees}))$$

What is this query asking?

Set Difference

Return all tuples appearing in the first, but not the second
of two union-compatible relations

$$(\pi_{SPC\_COMMON} (\sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees})) \\ ~~~~~~~~~- (\sigma_{(BORONAME=\texttt{'Brooklyn'})} (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees}))$$

What is this query asking?

Union, Intersection, Set Difference

What is the schema of the result of any of these operators?

Cross (Cartesian) Product ($\times$)

Create all pairs of tuples.

$$\pi_{SPC\_COMMON, BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON, AVG\_HEIGHT} (\textbf{TreeInfo})$$

TreeInfo

SPC_COMMONAVG_HEIGHT
cedar elm60
lacebark elm45
... and more
SPC_COMMONBORONAMESPC_COMMONAVG_HEIGHT
honeylocustBrooklyncedar elm60
honeylocustBrooklyncedar elm60
American lindenBrooklyncedar elm60
London planetreeManhattancedar elm60
London planetreeManhattancedar elm60
...
honeylocustBrooklynlacebark elm45
honeylocustBrooklynlacebark elm45
American lindenBrooklynlacebark elm45
London planetreeManhattanlacebark elm45
London planetreeManhattanlacebark elm45
... and more

Cross (Cartesian) Product ($\times$)

$$\pi_{SPC\_COMMON, BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON, AVG\_HEIGHT} (\textbf{TreeInfo})$$

What is the schema of the resulting relation?

The relation has a naming conflict
(two attributes with the same name)

Renaming ($\rho$)

$$\rho_{TNAME,\ BORO,\ INAME,\ HEIGHT}\left( \pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$

What is the schema of the resulting relation?

When writing cross-products,
I will use implicit renaming

Join ($\bowtie_c$)

Pair tuples according to a condition c.

$$\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \bowtie_{T.SPC\_COMMON = TI.SPC\_COMMON} \\\pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})$$
Identical to... $$\sigma_{T.SPC\_COMMON = TI.SPC\_COMMON}\left(\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$
$$R \bowtie_c S \equiv \sigma_c(R \times S)$$

Join Shorthands

Equi-joins are joins with only equality tests in the condition.

Join on attribute(s)
$R \bowtie_{A} S \equiv R \bowtie_{R.A = S.A} S$
Same values on the listed attributes
Natural Join
$R \bowtie S \equiv R \bowtie_{attrs(R) \cap attrs(S)} S$
Same values on all shared attributes

Join Shorthands

Formally: The output schema of a natural or equi-join still has both copies of each attribute.

Informally: I'll write things on the board assuming only one copy passes the join.

Which operators can create duplicates?

(Which operators behave differently in Set- and Bag-RA?)

Operator Symbol Creates Duplicates?
Selection $\sigma$ No
Projection $\pi$ Yes
Cross-product $\times$ No
Set-difference$-$ No
Union $\cup$ Yes
Join $\bowtie$No

Group Work

Using Set-Relational Algebra, find the BORONAMEs of all boroughs that do have tree species with an average height of below 45 inches

TreeInfo

SPC_COMMONAVG_HEIGHT
cedar elm60
lacebark elm45
... and more

Trees

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'American linden''Brooklyn'
'London planetree''Manhattan'
... and more

Respond on Sli.Do

$$\pi_{BORONAME}(\sigma_{AVG\_HEIGHT < 45}(\textbf{Trees}\bowtie\textbf{TreeInfo}))$$
$$\pi_{BORONAME}(\textbf{Trees}\bowtie\sigma_{AVG\_HEIGHT < 45}(\textbf{TreeInfo}))$$

Which of these queries is "faster"?

Division ($/$)

Not typically supported as a primitive operator,
but useful for expressing queries like:

Find species that appear in all boroughs

$$\pi_{BORONAME,\ SPC\_COMMON}(\textbf{Trees}) \;\;/\;\;\pi_{SPC\_COMMON}(\textbf{Trees})$$ (using set relational algebra)

$$R / S \equiv \{\; \left<\vec t\right> \;|\; \forall \left<\vec s\right> \in S, \left< \vec t \vec s \right> \in R \;\}$$

Division ($/$)

BORO SPC_COMMON
Brooklyn honeylocust
Brooklyn American linden
Brooklyn London planetree
Manhattan honeylocust
Manhattan American linden
Manhattan pin oak
Queens honeylocust
Queens American linden
Bronx honeylocust
/
SPC_COMMON
honeylocust
=
BORO
Brooklyn
Manhattan
Queens
Bronx
/
SPC_COMMON
honeylocust
American linden
=
BORO
Brooklyn
Manhattan
Queens
/
SPC_COMMON
honeylocust
American linden
pin oak
=
BORO
Manhattan

Group Work

If time permits: Implement division using other operators.

Abstract Syntax Trees (ASTs)

A tree-shaped representation of a computation.

RA is a model for SQL ASTs

Spark ASTs

org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
"Relational Algebra" (plus things)

SparkSQL uses a combination of Bag and Extended RA.

Expression


  abstract class Expression extends TreeNode[Expression] {
    ...
    def nullable: Boolean
    ...
    def references: AttributeSet = ...
    ...
    def eval(input: InternalRow = null): Any
    ...
    def dataType: DataType
    ...
  }
    

More on Expressions later, but...

Spark on GitHub: Expression.scala

Expression


  abstract class Expression extends TreeNode[Expression] {
    

↑↑
TreeNode[Expression]


  abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
    ...
    def children: Seq[BaseType]
    ...
    def foreach(f: BaseType => Unit): Unit
    ...
    def map[A](f: BaseType => A): Seq[A]
    ...
    def transform(rule: PartialFunction[BaseType, BaseType]): BaseType
    ...
    final def treeString: String = treeString(verbose = true)
    ...
  }
    
Spark on GitHub: TreeNode.scala

PartialFunction[A,B]


    { case Literal(y: Int, IntegerType) => Literal(y + 1) }
    

A Partial Function is a normal lambda paired with a match

"Unmatched" cases should be handled gracefully.

With tree.transform on an Expression


  { x => x match {
      case Literal(y: Int, IntegerType) => Literal(y + 1) 
      case _                            => x
  } }
    

  myExpression.transform {
    case Literal(y: Int, IntegerType) => Literal(y + 1) 
  }
    

What does this do?

Replace every integer in the tree with the same integer + 1

Digression over...
How does Spark encode Relational Algebra?

Logical Plans

Relational Algebra Nodes
Case Classes for each RA operator.
More than RA
Query support for JSON, ML Models, more.
SQL DDL/DML
e.g., CREATE TABLE (To make the parser simpler?)

LogicalPlan


  abstract class LogicalPlan
    extends QueryPlan[LogicalPlan]
    with AnalysisHelper
    with LogicalPlanStats
    with QueryPlanConstraints
    with Logging {
    ...
  }
    

Most useful tools live in QueryPlan

Spark on GitHub: LogicalPlan.scala

QueryPlan


  abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
    extends TreeNode[PlanType] with SQLConfHelper {
    ...
    def output: Seq[Attribute]
    ...
    def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    lazy val schema: StructType = StructType.fromAttributes(output)
    ...
  }
    
Spark on GitHub: QueryPlan.scala

Relational Algebra in Spark


  case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)

  case class Filter(condition: Expression, child: LogicalPlan)

  case class Union(children: Seq[LogicalPlan], byName: Boolean = false, 
                   allowMissingCol: Boolean = false)

  case class Join(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, 
                  condition: Option[Expression], hint: JoinHint)
    

Note: Spark's union is a Bag-union.

Spark on GitHub: basicLogicalOperators.scala

Relational Algebra in Spark


    /**
     * Holds the name of a relation that has yet to be looked up in a catalog.
     *
     * @param multipartIdentifier table name
     * @param options options to scan this relation. Only applicable to v2 table scan.
     */
    case class UnresolvedRelation(
        multipartIdentifier: Seq[String],
        options: CaseInsensitiveStringMap = CaseInsensitiveStringMap.empty(),
        override val isStreaming: Boolean = false)
    extends LeafNode with NamedRelation {
    
Spark on GitHub: analysis/unresolved.scala

QueryPlan


  abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
    extends TreeNode[PlanType] with SQLConfHelper {
    ...
    def output: Seq[Attribute]
    ...
    def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    lazy val schema: StructType = StructType.fromAttributes(output)
    ...
  }
    

Challenge: Remove all projection operators from a LogicalPlan

Spark on GitHub: QueryPlan.scala

  plan.transform { 
    case Project(_, child) => child
  }
    

Aside: Dataframe/Dataset is a thin wrapper around LogicalPlan.

class Dataset[T] private[sql](...) extends Serializable {
    ...
    /**
     * Filters rows using the given condition.
     * {{{
     *   // The following are equivalent:
     *   peopleDs.filter($"age" > 15)
     *   peopleDs.where($"age" > 15)
     * }}}
     *
     * @group typedrel
     * @since 1.6.0
     */
    def filter(condition: Column): Dataset[T] = withTypedPlan {
      Filter(condition.expr, logicalPlan)
    }
    ...
  }
    
Spark on GitHub: Dataset.scala

Relational Algebra

A simple way to think about and work with
computations over collections.

… simple → easy to evaluate

… simple → easy to optimize

Next time, Optimizing RA