February 9, 2021
Replace [Thing A] with better, but equivalent [Thing B].
Replace [Query A] with better, but equivalent [Query B].
... but first let's remind ourselves of few definitions.
Set
(Unique Only) |
Bag
|
List
(Order Matters) |
Declarative | Imperative |
---|---|
Say what you want | Say how you want to get it |
"Get me the TPS Reports" |
|
SQL, RA, R, … | C, Scala, Java, Python, … |
Declarative languages make it easier to explore equivalent computations to find the best one.
SELECT R.A FROM R, S
WHERE R.B = S.B AND S.C = 10
with open_csv("R.csv") as R:
with open_csv("S.csv") as S:
for r_row in R:
for s_row in S:
if r_row[1] == s_row[0] and s_row[1] == "10":
print(r_row[0])
$$Q(A) :- R(A, B), S(B, 10)$$
We start with a database instance with a fixed schema
Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$
Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)
Operation | Sym | Meaning |
---|---|---|
Selection | $\sigma$ | Select a subset of the input rows |
Projection | $\pi$ | Delete unwanted columns |
Cross-product | $\times$ | Combine two relations |
Set-difference | $-$ | Tuples in Rel 1, but not Rel 2 |
Union | $\cup$ | Tuples either in Rel 1 or in Rel 2 |
Also: Intersection, Join, Division, Renaming
(Not essential, but can be useful)
Input | Query Language | Output | ||
---|---|---|---|---|
Sets of Tuples | $\rightarrow$ | Set Relational Algebra | $\rightarrow$ | Set of Tuples |
Bags of Tuples | $\rightarrow$ | Bag Relational Algebra | $\rightarrow$ | Bag of Tuples |
Lists of Tuples | $\rightarrow$ | Extended Relational Algebra | $\rightarrow$ | List of Tuples |
First we focus on sets and bags.
Delete rows that fail the condition $c$.
TREE_ID | SPC_COMMON | BORONAME | ... |
---|---|---|---|
204026 | 'honeylocust' | 'Brooklyn' | ... |
204337 | 'honeylocust' | 'Brooklyn' | ... |
189565 | 'American linden' | 'Brooklyn' | ... |
192755 | 'London planetree' | 'Brooklyn' | ... |
189465 | 'London planetree' | 'Brooklyn' | ... |
... and 177287 more |
SELECT * FROM Trees [[ WHERE BORONAME = 'Brooklyn' ]]
Delete attributes not in the projection list $A$.
BORONAME |
---|
Queens |
Brooklyn |
Manhattan |
Bronx |
Staten Island |
Only 5 results... not 683788?
Set and Bag Projection are different
SELECT [[ DISTINCT BORONAME ]] FROM Trees;
What are these queries' schemas?
Takes two relations that are union-compatible...
(Both relations have the same number of fields with the same types)
... and returns all tuples appearing in either relation
We use $\uplus$ if we explicitly mean bag union
Return all tuples appearing in both
of two union-compatible relations
What is this query asking?
Return all tuples appearing in the first, but not the second
of two union-compatible relations
What is this query asking?
What is the schema of the result of any of these operators?
Create all pairs of tuples.
TreeInfo
SPC_COMMON | AVG_HEIGHT |
---|---|
cedar elm | 60 |
lacebark elm | 45 |
... and more |
SPC_COMMON | BORONAME | SPC_COMMON | AVG_HEIGHT |
---|---|---|---|
honeylocust | Brooklyn | cedar elm | 60 |
honeylocust | Brooklyn | cedar elm | 60 |
American linden | Brooklyn | cedar elm | 60 |
London planetree | Manhattan | cedar elm | 60 |
London planetree | Manhattan | cedar elm | 60 |
... | |||
honeylocust | Brooklyn | lacebark elm | 45 |
honeylocust | Brooklyn | lacebark elm | 45 |
American linden | Brooklyn | lacebark elm | 45 |
London planetree | Manhattan | lacebark elm | 45 |
London planetree | Manhattan | lacebark elm | 45 |
... and more |
What is the schema of the resulting relation?
The relation has a naming conflict
(two attributes with the same name)
What is the schema of the resulting relation?
When writing cross-products,
I will use implicit renaming
Pair tuples according to a condition c.
Equi-joins are joins with only equality tests in the condition.
Formally: The output schema of a natural or equi-join still has both copies of each attribute.
Informally: I'll write things on the board assuming only one copy passes the join.
(Which operators behave differently in Set- and Bag-RA?)
Operator | Symbol | Creates Duplicates? |
---|---|---|
Selection | $\sigma$ | No |
Projection | $\pi$ | Yes |
Cross-product | $\times$ | No |
Set-difference | $-$ | No |
Union | $\cup$ | Yes |
Join | $\bowtie$ | No |
Using Set-Relational Algebra, find the BORONAMEs of all boroughs that do have tree species with an average height of below 45 inches
TreeInfo
SPC_COMMON | AVG_HEIGHT |
---|---|
cedar elm | 60 |
lacebark elm | 45 |
... and more |
Trees
SPC_COMMON | BORONAME |
---|---|
'honeylocust' | 'Brooklyn' |
'American linden' | 'Brooklyn' |
'London planetree' | 'Manhattan' |
... and more |
Which of these queries is "faster"?
Not typically supported as a primitive operator,
but useful for expressing queries like:
Find species that appear in all boroughs
$$R / S \equiv \{\; \left<\vec t\right> \;|\; \forall \left<\vec s\right> \in S, \left< \vec t \vec s \right> \in R \;\}$$
BORO | SPC_COMMON |
---|---|
Brooklyn | honeylocust |
Brooklyn | American linden |
Brooklyn | London planetree |
Manhattan | honeylocust |
Manhattan | American linden |
Manhattan | pin oak |
Queens | honeylocust |
Queens | American linden |
Bronx | honeylocust |
/
| |||||||
/
| |||||||
/
|
If time permits: Implement division using other operators.
A tree-shaped representation of a computation.
RA is a model for SQL ASTs
SparkSQL uses a combination of Bag and Extended RA.
abstract class Expression extends TreeNode[Expression] {
...
def nullable: Boolean
...
def references: AttributeSet = ...
...
def eval(input: InternalRow = null): Any
...
def dataType: DataType
...
}
More on Expressions later, but...
abstract class Expression extends TreeNode[Expression] {
↑↑
TreeNode[Expression]
abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
...
def children: Seq[BaseType]
...
def foreach(f: BaseType => Unit): Unit
...
def map[A](f: BaseType => A): Seq[A]
...
def transform(rule: PartialFunction[BaseType, BaseType]): BaseType
...
final def treeString: String = treeString(verbose = true)
...
}
{ case Literal(y: Int, IntegerType) => Literal(y + 1) }
A Partial Function is a normal lambda paired with a match
"Unmatched" cases should be handled gracefully.
With tree.transform on an Expression
{ x => x match {
case Literal(y: Int, IntegerType) => Literal(y + 1)
case _ => x
} }
myExpression.transform {
case Literal(y: Int, IntegerType) => Literal(y + 1)
}
What does this do?
Replace every integer in the tree with the same integer + 1
Digression over...
How does Spark encode Relational Algebra?
abstract class LogicalPlan
extends QueryPlan[LogicalPlan]
with AnalysisHelper
with LogicalPlanStats
with QueryPlanConstraints
with Logging {
...
}
Most useful tools live in QueryPlan
abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
extends TreeNode[PlanType] with SQLConfHelper {
...
def output: Seq[Attribute]
...
def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
...
def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
...
lazy val schema: StructType = StructType.fromAttributes(output)
...
}
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
case class Filter(condition: Expression, child: LogicalPlan)
case class Union(children: Seq[LogicalPlan], byName: Boolean = false,
allowMissingCol: Boolean = false)
case class Join(left: LogicalPlan, right: LogicalPlan, joinType: JoinType,
condition: Option[Expression], hint: JoinHint)
Note: Spark's union is a Bag-union.
/**
* Holds the name of a relation that has yet to be looked up in a catalog.
*
* @param multipartIdentifier table name
* @param options options to scan this relation. Only applicable to v2 table scan.
*/
case class UnresolvedRelation(
multipartIdentifier: Seq[String],
options: CaseInsensitiveStringMap = CaseInsensitiveStringMap.empty(),
override val isStreaming: Boolean = false)
extends LeafNode with NamedRelation {
abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
extends TreeNode[PlanType] with SQLConfHelper {
...
def output: Seq[Attribute]
...
def transformExpressions(rule: PartialFunction[Expression, Expression]): ...
...
def transformAllExpressions(rule: PartialFunction[Expression, Expression]): ...
...
lazy val schema: StructType = StructType.fromAttributes(output)
...
}
Challenge: Remove all projection operators from a LogicalPlan
plan.transform {
case Project(_, child) => child
}
Aside: Dataframe/Dataset is a thin wrapper around LogicalPlan.
class Dataset[T] private[sql](...) extends Serializable {
...
/**
* Filters rows using the given condition.
* {{{
* // The following are equivalent:
* peopleDs.filter($"age" > 15)
* peopleDs.where($"age" > 15)
* }}}
*
* @group typedrel
* @since 1.6.0
*/
def filter(condition: Column): Dataset[T] = withTypedPlan {
Filter(condition.expr, logicalPlan)
}
...
}
A simple way to think about and work with
computations over collections.
… simple → easy to evaluate
… simple → easy to optimize
Next time, Optimizing RA