CSE-4/562 Spring 2019 - Relational Algebra

Relational Algebra

CSE-4/562 Spring 2019

February 4, 2019

Textbook: Ch. 2.4, 5.1

The running theme...

Replace [Thing A] with better, but equivalent [Thing B].

  1. How can we tell if [Thing B] is equivalent to [Thing A]?
  2. How can we tell if [Thing B] is better than [Thing A]?

Replace [Query A] with better, but equivalent [Query B].

  1. How can we tell if [Query B] is equivalent to [Query A]?
  2. How can we tell if [Query B] is better than [Query A]?

... but first a few definitions.

Relational Data

Relation (Table)
A collection of Tuples of Values
All tuples have the same set of attributes, or schema

Relational Data

Types of relations...

Set

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'American linden''Brooklyn'
'London planetree''Brooklyn'
  
  

(Unique Only)

Bag

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'London planetree''Brooklyn'
'American linden''Brooklyn'
'London planetree''Brooklyn'
'honeylocust''Brooklyn'

List

SPC_COMMONBORONAME
'American linden''Brooklyn'
'honeylocust''Brooklyn'
'honeylocust''Brooklyn'
'London planetree''Brooklyn'
'London planetree''Brooklyn'

(Order Matters)

Declarative Languages

DeclarativeImperative
Say what you want Say how you want to get it
"Get me the TPS Reports"
Look at every T report 
For each week:
  Sum up the sprocket count
  Find that week's S report
  ... 
SQL, RA, R, … C, Scala, Java, Python, …

Declarative languages make it easier to explore equivalent computations to find the best one.

SQL


                      SELECT R.A FROM R, S 
                      WHERE R.B = S.B AND S.C = 10
    

Python


          with open_csv("R.csv") as R:
            with open_csv("S.csv") as S:
              for r_row in R:
                for s_row in S:
                  if r_row[1] == s_row[0] and s_row[1] == "10":
                    print(r_row[0])
    

Declarative Queries

SQL
Readability: Focus on what and where from
Relational Algebra
Data Flow: Focus on transformations
Relational Calculus
Minimality: Focus on attribute relationships

Relational Algebra

$$\pi_{R.A}(\sigma_{S.C=10}(\bowtie_{R.B = S.B}(\textbf{R}, \textbf{S})))$$

Relational Calculus

$$Q(A) := R(A, B), S(B, 10)$$

Declarative Queries

SQL
Readability: Focus on what and where from
Relational Algebra
Data Flow: Focus on transformations
Relational Calculus
Minimality: Focus on attribute relationships

Preliminaries

We start with a database instance with a fixed schema

Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$

Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)

Relational Algebra

OperationSymMeaning
Selection$\sigma$Select a subset of the input rows
Projection$\pi$Delete unwanted columns
Cross-product$\times$Combine two relations
Set-difference$-$Tuples in Rel 1, but not Rel 2
Union$\cup$Tuples either in Rel 1 or in Rel 2

Also: Intersection, Join, Division, Renaming

(Not essential, but can be useful)

Input Query Language Output
Sets of Tuples $\rightarrow$ Set Relational Algebra $\rightarrow$ Set of Tuples
Bags of Tuples $\rightarrow$ Bag Relational Algebra $\rightarrow$ Bag of Tuples
Lists of Tuples $\rightarrow$ Extended Relational Algebra $\rightarrow$ List of Tuples

First we focus on sets and bags.

Selection ($\sigma_{c}$)

Delete rows that fail the condition $c$.

$$\sigma_{(\textbf{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$
TREE_IDSPC_COMMONBORONAME...
204026'honeylocust''Brooklyn'...
204337'honeylocust''Brooklyn'...
189565'American linden''Brooklyn'...
192755'London planetree''Brooklyn'...
189465'London planetree''Brooklyn'...
... and 177287 more

          SELECT * FROM Trees [[ WHERE BORONAME = 'Brooklyn' ]]
      

Projection ($\pi_{A}$)

Delete attributes not in the projection list $A$.

$$\pi_{\textbf{BORONAME}}(\textbf{Trees})$$
BORONAME
Queens
Brooklyn
Manhattan
Bronx
Staten Island

Only 5 results... not 683788?

Set and Bag Projection are different


              SELECT [[ DISTINCT BORONAME ]] FROM Trees;
      

Reminder: Queries are Relations

What are these queries schemas?

$$\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees}$$
$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})} \textbf{Trees}$$
$$\sigma_{(\textit{BORONAME} = \texttt{'Brooklyn'})}(\pi_{\textit{TREEID},\ \textit{SPC_COMMON},\ \textit{BORONAME}} \textbf{Trees})$$

Union ($\cup$)

Takes two relations that are union-compatible...

(Both relations have the same number of fields with the same types)

... and returns all tuples appearing in either relation

$$(\sigma_{(BORONAME=\texttt{'Brooklyn'})} \textbf{Trees}) \cup (\sigma_{(BORONAME=\texttt{'Manhattan'})} \textbf{Trees})$$

We use $\uplus$ if we explicitly mean bag union

Intersection ($\cap$)

Return all tuples appearing in both
of two union-compatible relations

$$(\sigma_{(BORONAME=\texttt{'Brooklyn'})} (\pi_{SPC\_COMMON} \textbf{Trees})) \\ ~~~~~~~~~\cap (\sigma_{(BORONAME=\texttt{'Manhattan'})} (\pi_{SPC\_COMMON} \textbf{Trees}))$$

What is this query asking?

Set Difference

Return all tuples appearing in the first, but not the second
of two union-compatible relations

$$(\sigma_{(BORONAME=\texttt{'Brooklyn'})} (\pi_{SPC\_COMMON} \textbf{Trees})) \\ ~~~~~~~~~- (\sigma_{(BORONAME=\texttt{'Manhattan'})} (\pi_{SPC\_COMMON} \textbf{Trees}))$$

What is this query asking?

Union, Intersection, Set Difference

What is the schema of the result of any of these operators?

Cross (Cartesian) Product ($\times$)

Create all pairs of tuples.

$$\pi_{SPC\_COMMON, AVG\_HEIGHT} (\textbf{TreeInfo}) \times \pi_{SPC\_COMMON, BORONAME} (\textbf{Trees})$$

TreeInfo

SPC_COMMONAVG_HEIGHT
cedar elm60
lacebark elm45
... and more
SPC_COMMONBORONAMESPC_COMMONAVG_HEIGHT
honeylocustBrooklyncedar elm60
honeylocustBrooklyncedar elm60
American lindenBrooklyncedar elm60
London planetreeManhattancedar elm60
London planetreeManhattancedar elm60
...
honeylocustBrooklynlacebark elm45
honeylocustBrooklynlacebark elm45
American lindenBrooklynlacebark elm45
London planetreeManhattanlacebark elm45
London planetreeManhattanlacebark elm45
... and more

Cross (Cartesian) Product ($\times$)

$$\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})$$

What is the schema of the resulting relation?

The relation has a naming conflict
(two attributes with the same name)

Renaming ($\rho$)

$$\rho_{TNAME,\ BORO,\ INAME,\ HEIGHT}\left( \pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$

What is the schema of the resulting relation?

When writing cross-products on the board,
I will use implicit renaming

Join ($\bowtie_c$)

Pair tuples according to a condition c.

$$\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \bowtie_{T.SPC\_COMMON = TI.SPC\_COMMON} \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})$$
Identical to... $$\sigma_{T.SPC\_COMMON = TI.SPC\_COMMON}\left(\pi_{SPC\_COMMON,\ BORONAME} (\textbf{Trees}) \times \pi_{SPC\_COMMON,\ AVG\_HEIGHT} (\textbf{TreeInfo})\right)$$
$$R \bowtie_c S \equiv \sigma_c(R \times S)$$

Join Shorthands

Equi-joins are joins with only equality tests in the condition.

Join on attribute(s)
$R \bowtie_{A} S \equiv R \bowtie_{R.A = S.A} S$
Same values on the listed attributes
Natural Join
$R \bowtie S \equiv R \bowtie_{attrs(R) \cap attrs(S)} S$
Same values on all shared attributes

Which operators can create duplicates?

(Which operators behave differently in Set- and Bag-RA?)

Operator Symbol Creates Duplicates?
Selection $\sigma$ No
Projection $\pi$ Yes
Cross-product $\times$ No
Set-difference$-$ No
Union $\cup$ Yes
Join $\bowtie$No

Group Work

Using Set-Relational Algebra, find the BORONAMEs of all boroughs that do have tree species with an average height of below 45 inches

TreeInfo

SPC_COMMONAVG_HEIGHT
cedar elm60
lacebark elm45
... and more

Trees

SPC_COMMONBORONAME
'honeylocust''Brooklyn'
'American linden''Brooklyn'
'London planetree''Manhattan'
... and more
$$\pi_{BORONAME}(\sigma_{AVG\_HEIGHT < 45}(\textbf{Trees}\bowtie\textbf{TreeInfo}))$$
$$\pi_{BORONAME}(\textbf{Trees}\bowtie\sigma_{AVG\_HEIGHT < 45}(\textbf{TreeInfo}))$$

Division ($/$)

Not typically supported as a primitive operator,
but useful for expressing queries like:

Find species that appear in all boroughs

$$\pi_{BORONAME,\ SPC\_COMMON}(\textbf{Trees}) \;\;/\;\;\pi_{SPC\_COMMON}(\textbf{Trees})$$ (using set relational algebra)

$$R / S \equiv \{\; \left<\vec t\right> \;|\; \forall \left<\vec s\right> \in S, \left< \vec t \vec s \right> \in R \;\}$$

Division ($/$)

BORO SPC_COMMON
Brooklyn honeylocust
Brooklyn American linden
Brooklyn London planetree
Manhattan honeylocust
Manhattan American linden
Manhattan pin oak
Queens honeylocust
Queens American linden
Bronx honeylocust
/
SPC_COMMON
honeylocust
=
BORO
Brooklyn
Manhattan
Queens
Bronx
/
SPC_COMMON
honeylocust
American linden
=
BORO
Brooklyn
Manhattan
Queens
/
SPC_COMMON
honeylocust
American linden
pin oak
=
BORO
Manhattan

Group Work

If time permits: Implement division using other operators.

Relational Algebra

A simple way to think about and work with
computations over collections.

… simple → easy to evaluate

… simple → easy to optimize

Next time, Optimizing RA