- Relational Algebra

Relational Algebra

February 9, 2021

Garcia-Molina/Ullman/Widom: Ch. 2.4, 5.1

The running theme...

Replace [Thing A] with better, but equivalent [Thing B].

How can we tell if [Thing B] is equivalent to [Thing A]?
How can we tell if [Thing B] is better than [Thing A]?

Replace [Query A] with better, but equivalent [Query B].

How can we tell if [Query B] is equivalent to [Query A]?
How can we tell if [Query B] is better than [Query A]?

... but first let's remind ourselves of few definitions.

Relational Data

Relation (Table): A collection of Tuples of Primitive Values; All tuples have the same set of attributes, or schema

Relational Data

Types of relations...

Set

SPC_COMMON	BORONAME
'honeylocust'	'Brooklyn'
'American linden'	'Brooklyn'
'London planetree'	'Brooklyn'

(Unique Only)

Bag

SPC_COMMON	BORONAME
'honeylocust'	'Brooklyn'
'London planetree'	'Brooklyn'
'American linden'	'Brooklyn'
'London planetree'	'Brooklyn'
'honeylocust'	'Brooklyn'

List

SPC_COMMON	BORONAME
'American linden'	'Brooklyn'
'honeylocust'	'Brooklyn'
'honeylocust'	'Brooklyn'
'London planetree'	'Brooklyn'
'London planetree'	'Brooklyn'

(Order Matters)

Declarative Languages

Declarative	Imperative
Say what you want	Say how you want to get it
"Get me the TPS Reports"	`Look at every T report For each week: Sum up the sprocket count Find that week's S report ...`
SQL, RA, R, …	C, Scala, Java, Python, …

Declarative languages make it easier to explore equivalent computations to find the best one.

SQL


                      SELECT R.A FROM R, S 
                      WHERE R.B = S.B AND S.C = 10

Python


          with open_csv("R.csv") as R:
            with open_csv("S.csv") as S:
              for r_row in R:
                for s_row in S:
                  if r_row[1] == s_row[0] and s_row[1] == "10":
                    print(r_row[0])

Declarative Queries

SQL: Readability: Focus on what and where from
Relational Algebra: Data Flow: Focus on transformations
Relational Calculus: Minimality: Focus on attribute relationships

Relational Algebra

$$\pi_{R.A}(\sigma_{S.C=10}(\bowtie_{R.B = S.B}(\textbf{R}, \textbf{S})))$$

Relational Calculus

$$Q(A) :- R(A, B), S(B, 10)$$

Declarative Queries

SQL: Readability: Focus on what and where from
Relational Algebra: Data Flow: Focus on transformations
Relational Calculus: Minimality: Focus on attribute relationships

Preliminaries

We start with a database instance with a fixed schema

Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$

Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)

Relational Algebra

Operation	Sym	Meaning
Selection	$\sigma$	Select a subset of the input rows
Projection	$\pi$	Delete unwanted columns
Cross-product	$\times$	Combine two relations
Set-difference	$-$	Tuples in Rel 1, but not Rel 2
Union	$\cup$	Tuples either in Rel 1 or in Rel 2

Also: Intersection, Join, Division, Renaming

(Not essential, but can be useful)

Input		Query Language		Output
Sets of Tuples	$\rightarrow$	Set Relational Algebra	$\rightarrow$	Set of Tuples
Bags of Tuples	$\rightarrow$	Bag Relational Algebra	$\rightarrow$	Bag of Tuples
Lists of Tuples	$\rightarrow$	Extended Relational Algebra	$\rightarrow$	List of Tuples

First we focus on sets and bags.

Selection ($\sigma_{c}$)

Delete rows that fail the condition $c$.

$$\sigma_{(\textbf{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$

TREE_ID	SPC_COMMON	BORONAME	...
204026	'honeylocust'	'Brooklyn'	...
204337	'honeylocust'	'Brooklyn'	...
189565	'American linden'	'Brooklyn'	...
192755	'London planetree'	'Brooklyn'	...
189465	'London planetree'	'Brooklyn'	...
... and 177287 more


          SELECT * FROM Trees [[ WHERE BORONAME = 'Brooklyn' ]]

Projection ($\pi_{A}$)

Delete attributes not in the projection list $A$.

$$\pi_{\textbf{BORONAME}}(\textbf{Trees})$$

BORONAME
Queens
Brooklyn
Manhattan
Bronx
Staten Island

Only 5 results... not 683788?

Set and Bag Projection are different


              SELECT [[ DISTINCT BORONAME ]] FROM Trees;

Reminder: Queries are Relations

What are these queries' schemas?

$$\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees}$$

$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$

$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})}(\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees})$$

Union ($\cup$)

Takes two relations that are union-compatible...

(Both relations have the same number of fields with the same types)

... and returns all tuples appearing in either relation

$$(\sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees}) \cup (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees})$$

We use $\uplus$ if we explicitly mean bag union

Intersection ($\cap$)

Return all tuples appearing in both
of two union-compatible relations

$$(\pi_{SPC\_COMMON} ( \sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees}))\\ ~~~~~~~~~\cap (\pi_{SPC\_COMMON} (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees}))$$

What is this query asking?

Set Difference

Return all tuples appearing in the first, but not the second
of two union-compatible relations

$$(\pi_{SPC\_COMMON} (\sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees})) \\ ~~~~~~~~~- (\sigma_{(BORONAME=\texttt{'Brooklyn'})} (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees}))$$

What is this query asking?

Union, Intersection, Set Difference

What is the schema of the result of any of these operators?

Cross (Cartesian) Product ($\times$)

Create all pairs of tuples.

$$\pi_{SPC\_COMMON, BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON, AVG\_HEIGHT} (\textbf{TreeInfo})$$

TreeInfo

SPC_COMMON	AVG_HEIGHT
cedar elm	60
lacebark elm	45
... and more

SPC_COMMON	BORONAME	SPC_COMMON	AVG_HEIGHT
honeylocust	Brooklyn	cedar elm	60
honeylocust	Brooklyn	cedar elm	60
American linden	Brooklyn	cedar elm	60
London planetree	Manhattan	cedar elm	60
London planetree	Manhattan	cedar elm	60
...
honeylocust	Brooklyn	lacebark elm	45
honeylocust	Brooklyn	lacebark elm	45
American linden	Brooklyn	lacebark elm	45
London planetree	Manhattan	lacebark elm	45
London planetree	Manhattan	lacebark elm	45
... and more

Cross (Cartesian) Product ($\times$)

$$\pi_{SPC\_COMMON, BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON, AVG\_HEIGHT} (\textbf{TreeInfo})$$

What is the schema of the resulting relation?

The relation has a naming conflict
(two attributes with the same name)

Renaming ($\rho$)

$$\rho_{TNAME,\ BORO,\ INAME,\ HEIGHT}\left( \pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$

What is the schema of the resulting relation?

When writing cross-products,
I will use implicit renaming

Join ($\bowtie_c$)

Pair tuples according to a condition c.

$$\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \bowtie_{T.SPC\_COMMON = TI.SPC\_COMMON} \\\pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})$$

Identical to... $$\sigma_{T.SPC\_COMMON = TI.SPC\_COMMON}\left(\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$

$$R \bowtie_c S \equiv \sigma_c(R \times S)$$

Join Shorthands

Equi-joins are joins with only equality tests in the condition.

Join on attribute(s): $R \bowtie_{A} S \equiv R \bowtie_{R.A = S.A} S$; Same values on the listed attributes
Natural Join: $R \bowtie S \equiv R \bowtie_{attrs(R) \cap attrs(S)} S$; Same values on all shared attributes

Join Shorthands

Formally: The output schema of a natural or equi-join still has both copies of each attribute.

Informally: I'll write things on the board assuming only one copy passes the join.

Which operators can create duplicates?

(Which operators behave differently in Set- and Bag-RA?)

Operator	Symbol	Creates Duplicates?
Selection	$\sigma$	No
Projection	$\pi$	Yes
Cross-product	$\times$	No
Set-difference	$-$	No
Union	$\cup$	Yes
Join	$\bowtie$	No

Group Work

Using Set-Relational Algebra, find the BORONAMEs of all boroughs that do have tree species with an average height of below 45 inches

TreeInfo

SPC_COMMON	AVG_HEIGHT
cedar elm	60
lacebark elm	45
... and more

Trees

SPC_COMMON	BORONAME
'honeylocust'	'Brooklyn'
'American linden'	'Brooklyn'
'London planetree'	'Manhattan'
... and more

Respond on Sli.Do

$$\pi_{BORONAME}(\sigma_{AVG\_HEIGHT < 45}(\textbf{Trees}\bowtie\textbf{TreeInfo}))$$

$$\pi_{BORONAME}(\textbf{Trees}\bowtie\sigma_{AVG\_HEIGHT < 45}(\textbf{TreeInfo}))$$

Which of these queries is "faster"?

Division ($/$)

Not typically supported as a primitive operator,
but useful for expressing queries like:

Find species that appear in all boroughs

$$\pi_{BORONAME,\ SPC\_COMMON}(\textbf{Trees}) \;\;/\;\;\pi_{SPC\_COMMON}(\textbf{Trees})$$ (using set relational algebra)

$$R / S \equiv \{\; \left<\vec t\right> \;|\; \forall \left<\vec s\right> \in S, \left< \vec t \vec s \right> \in R \;\}$$

Division ($/$)

BORO	SPC_COMMON
Brooklyn	honeylocust
Brooklyn	American linden
Brooklyn	London planetree
Manhattan	honeylocust
Manhattan	American linden
Manhattan	pin oak
Queens	honeylocust
Queens	American linden
Bronx	honeylocust

SPC_COMMON
honeylocust

BORO
Brooklyn
Manhattan
Queens
Bronx

SPC_COMMON
honeylocust
American linden

BORO
Brooklyn
Manhattan
Queens

SPC_COMMON
honeylocust
American linden
pin oak

BORO
Manhattan

Group Work

If time permits: Implement division using other operators.

Abstract Syntax Trees (ASTs)

A tree-shaped representation of a computation.

RA is a model for SQL ASTs

Spark ASTs

org.apache.spark.sql.catalyst.expressions.Expression: Primitive-valued expressions
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan: "Relational Algebra" (plus things)

SparkSQL uses a combination of Bag and Extended RA.

Expression


  abstract class Expression extends TreeNode[Expression] {
    ...
    def nullable: Boolean
    ...
    def references: AttributeSet = ...
    ...
    def eval(input: InternalRow = null): Any
    ...
    def dataType: DataType
    ...
  }

Expression


  abstract class Expression extends TreeNode[Expression] {

↑↑
TreeNode[Expression]


  abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
    ...
    def children: Seq[BaseType]
    ...
    def foreach(f: BaseType => Unit): Unit
    ...
    def map[A](f: BaseType => A): Seq[A]
    ...
    def transform(rule: PartialFunction[BaseType, BaseType]): BaseType
    ...
    final def treeString: String = treeString(verbose = true)
    ...
  }

Spark on GitHub: TreeNode.scala

PartialFunction[A,B]


    { case Literal(y: Int, IntegerType) => Literal(y + 1) }

A Partial Function is a normal lambda paired with a match

"Unmatched" cases should be handled gracefully.

With tree.transform on an Expression


  { x => x match {
      case Literal(y: Int, IntegerType) => Literal(y + 1) 
      case _                            => x
  } }


  myExpression.transform {
    case Literal(y: Int, IntegerType) => Literal(y + 1) 
  }

What does this do?

Replace every integer in the tree with the same integer + 1

Digression over...
How does Spark encode Relational Algebra?

Logical Plans

Relational Algebra Nodes: Case Classes for each RA operator.
More than RA: Query support for JSON, ML Models, more.
SQL DDL/DML: e.g., CREATE TABLE (To make the parser simpler?)

LogicalPlan


  abstract class LogicalPlan
    extends QueryPlan[LogicalPlan]
    with AnalysisHelper
    with LogicalPlanStats
    with QueryPlanConstraints
    with Logging {
    ...
  }

Most useful tools live in QueryPlan

Spark on GitHub: LogicalPlan.scala

QueryPlan


  abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
    extends TreeNode[PlanType] with SQLConfHelper {
    ...
    def output: Seq[Attribute]
    ...
    def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    lazy val schema: StructType = StructType.fromAttributes(output)
    ...
  }

Spark on GitHub: QueryPlan.scala

Relational Algebra in Spark


  case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)

  case class Filter(condition: Expression, child: LogicalPlan)

  case class Union(children: Seq[LogicalPlan], byName: Boolean = false, 
                   allowMissingCol: Boolean = false)

  case class Join(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, 
                  condition: Option[Expression], hint: JoinHint)

Note: Spark's union is a Bag-union.

Spark on GitHub: basicLogicalOperators.scala

Relational Algebra in Spark


    /**
     * Holds the name of a relation that has yet to be looked up in a catalog.
     *
     * @param multipartIdentifier table name
     * @param options options to scan this relation. Only applicable to v2 table scan.
     */
    case class UnresolvedRelation(
        multipartIdentifier: Seq[String],
        options: CaseInsensitiveStringMap = CaseInsensitiveStringMap.empty(),
        override val isStreaming: Boolean = false)
    extends LeafNode with NamedRelation {

Spark on GitHub: analysis/unresolved.scala

QueryPlan


  abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
    extends TreeNode[PlanType] with SQLConfHelper {
    ...
    def output: Seq[Attribute]
    ...
    def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
    ...
    lazy val schema: StructType = StructType.fromAttributes(output)
    ...
  }

Challenge: Remove all projection operators from a LogicalPlan

Spark on GitHub: QueryPlan.scala


  plan.transform { 
    case Project(_, child) => child
  }

Aside: Dataframe/Dataset is a thin wrapper around LogicalPlan.

class Dataset[T] private[sql](...) extends Serializable {
    ...
    /**
     * Filters rows using the given condition.
     * {{{
     *   // The following are equivalent:
     *   peopleDs.filter($"age" > 15)
     *   peopleDs.where($"age" > 15)
     * }}}
     *
     * @group typedrel
     * @since 1.6.0
     */
    def filter(condition: Column): Dataset[T] = withTypedPlan {
      Filter(condition.expr, logicalPlan)
    }
    ...
  }

Spark on GitHub: Dataset.scala

Relational Algebra

A simple way to think about and work with
computations over collections.

… simple → easy to evaluate

… simple → easy to optimize

Next time, Optimizing RA