CSE-4/562 Spring 2019

Extended Relational Algebra

February 13, 2019

Textbook: Ch. 5.2, 15.4

Extended Relational Algebra

Set/Bag Operations: Select ($\sigma$), Project ($\pi$), Join ($\bowtie$), Union ($\cup$)
Bag Operations: Distinct ($\delta$), Outer Joins (⟗)
List Operations: Sort ($\tau$), Limit (L)
Arithmetic Operations: Extended Projection ($\pi$), Aggregation ($\Sigma$), Grouping ($\gamma$)

Extended Projection

Like normal projection, but can create new columns

$\pi_{M \leftarrow A+B*C,\; N \leftarrow 2}(R)$

produces 1 row for every row of R, with 2 columns: M and N

Outer Join

... but first

`NULL` Values

Field values can be unknown or inapplicable.
- A tree with an unknown species.
- The street of a tree in a park.
- The '.' on many of your ID cards
SQL provides a special NULL value for these cases

NULL makes things more complicated.

$$\textbf{Trees.SPC_COMMON} = \texttt{'Brooklyn'}$$

What happens if Trees.SPC_COMMON is NULL?

$$\texttt{NULL} = \textbf{'Brooklyn'} \equiv \textbf{Unknown}$$

Unknown	AND	Unknown	$\equiv$	Unknown
Unknown	AND	True	$\equiv$	Unknown
Unknown	AND	False	$\equiv$	False

Unknown	OR	Unknown	$\equiv$	Unknown
Unknown	OR	True	$\equiv$	True
Unknown	OR	False	$\equiv$	Unknown

	NOT	Unknown	$\equiv$	Unknown

WHERE clauses eliminate all non-True rows

$$Streets \bowtie_{StreetName} Trees$$

What happens if some streets have no trees?

Outer Join (⟗, ⟕, ⟖)

Include all results from the normal (inner) join.
Also include rows that don't get joined.

Outer Join

Inner Join: Normal, plain, simple join
Left Outer Join (⟕): Include un-joined rows from the left hand side
Right Outer Join (⟖): Include un-joined rows from the right hand side
[Full] Outer Join (⟗): Include un-joined rows from either side

Streets

District	StreetName
A	Apple St.
B	Border St.
C	Coventry St.
D	Durham St.

Trees

TreeId	StreetName
1	Apple St.
2	Border St.
3	Durham St.
4	Elicott St.

Streets ⟗ Trees

District	StreetName	TreeId	StreetName
A	Apple St.	1	Apple St.
B	Border St.	2	Border St.
C	Coventry St.	NULL	NULL
D	Durham St.	3	Durham St.
NULL	NULL	4	Elicott St.

Only LEFT outer join

Only RIGHT outer join

Sort / Limit

$$\tau_{A}(R)$$ The tuples of $R$ in ascending order according to 'A'

$$\textbf{L}_{n}(R)$$ The first $n$ tuples of R

(Typically combined with sort. If not, pick arbitrarily.)

Sort

Pick your favorite sort algorithm.

What happens if you don't have enough memory?

Key Idea: Merging 2 sorted lists requires $O(1)$ memory.

2-Way Sort

Pass 1: Create lots of (small) sorted lists.
Pass 2+: Merge sorted lists of size $N$ into sorted lists of size $2N$

Pass 1: Create Sorted Runs

Pass 2: Merge Sorted Runs

Repeat Pass 2 As Needed.

What's the bottleneck?

IO Cost: $O(N \cdot \lceil\log_2(N)\rceil)$
(with $N$ blocks)

Using More Memory

Pass 1: Sort Bigger Buffers; Re-Use Memory When Done
Pass 2: Merge $K$ Runs Simultaneously: $O(N \cdot \lceil\log_K(N)\rceil)$ IO

Replacement Sort

On average, we'll get runs of size $2 \cdot |WS|$

Aggregation

Normal Aggregates: SELECT COUNT(*) FROM R; SELECT SUM(A) FROM R
Group-By Aggregates: SELECT A, SUM(B) FROM R GROUP BY A
Distinct: SELECT DISTINCT A FROM R; SELECT A FROM R GROUP BY A

Normal Aggregates

TREE_ID	SPC_COMMON	BORONAME	TREE_DBH
180683	'red maple'	'Queens'	3
315986	'pin oak'	'Queens'	21
204026	'honeylocust'	'Brooklyn'	3
204337	'honeylocust'	'Brooklyn'	10
189565	'American linden'	'Brooklyn'	21
... and 683783 more

SELECT COUNT(*) FROM TREES

TREE_ID	SPC_COMMON	BORONAME	TREE_DBH
COUNT = 0
180683	'red maple'	'Queens'	3
COUNT = 1
315986	'pin oak'	'Queens'	21
COUNT = 2
204026	'honeylocust'	'Brooklyn'	3
COUNT = 3
204337	'honeylocust'	'Brooklyn'	10
COUNT = 4
189565	'American linden'	'Brooklyn'	21
COUNT = 5
... and 683783 more
COUNT = 683788

SELECT SUM(TREE_DBH) FROM TREES

TREE_ID	SPC_COMMON	BORONAME	TREE_DBH
SUM = 0
180683	'red maple'	'Queens'	3
SUM = 3
315986	'pin oak'	'Queens'	21
SUM = 24
204026	'honeylocust'	'Brooklyn'	3
SUM = 27
204337	'honeylocust'	'Brooklyn'	10
SUM = 37
189565	'American linden'	'Brooklyn'	21
SUM = 58
... and 683783 more

Basic Aggregate Pattern

This is also sometimes called a "fold"

Init: Define a starting value for the accumulator
Fold(Accum, New): Merge a new value into the accumulator

COUNT(*)

Init: $0$
Fold(Accum, New): $Accum + 1$

SUM(A)

Init: $0$
Fold(Accum, New): $Accum + New$

AVG(A)

Init: $\{ sum = 0, count = 0 \}$
Fold(Accum, New): $\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
Finalize(Accum): $\frac{Accum.sum}{Accum.count}$

Basic Aggregate Pattern

Init: Define a starting value for the accumulator
Fold(Accum, New): Merge a new value into the accumulator
Finalize(Accum): Extract the aggregate from the accumulator.

Basic Aggregate Types

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Distributive: Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
Algebraic: Finite-sized accumulator but needs a finalize (AVG)
Holistic: Unbounded accumulator (MEDIAN)

Group-By Aggregates

SELECT SPC_COMMON, COUNT(*) FROM TREES GROUP BY SPC_COMMON

Naive Idea: Keep a separate accumulator for each group

TREE_ID	SPC_COMMON	BORONAME	TREE_DBH
{}
180683	'red maple'	'Queens'	3
{ 'red maple' = 1 }
204337	'honeylocust'	'Brooklyn'	10
{ 'red maple' = 1, 'honeylocust' = 1 }
315986	'pin oak'	'Queens'	21
{ 'red maple' = 1, 'honeylocust' = 1, 'pin oak' = 1 }
204026	'honeylocust'	'Brooklyn'	3
{ 'red maple' = 1, 'honeylocust' = 2, 'pin oak' = 1 }

What could go wrong?

Alternative Grouping Algorithms

2-pass Hash Aggregate: Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
Sort-Aggregate: Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.

TREE_ID	SPC_COMMON	BORONAME	TREE_DBH
{}
204337	'honeylocust'	'Brooklyn'	10
{ 'honeylocust' = 1 }
204026	'honeylocust'	'Brooklyn'	3
{ 'honeylocust' = 2 }
... and more
315986	'pin oak'	'Queens'	21
{ 'honeylocust' = 3206, 'pin oak' = 1 }
... and more
180683	'red maple'	'Queens'	3
{ 'pin oak' = 53814, 'red maple' = 1 }

Set/Bag Operations: Select ($\sigma$), Project ($\pi$), Join ($\bowtie$), Union ($\cup$)
Bag Operations: Distinct ($\delta$), Outer Joins (⟗)
List Operations: Sort ($\tau$), Limit (L)
Arithmetic Operations: Extended Projection ($\pi$), Aggregation ($\sigma$), Grouping ($\gamma$)

Extended Relational Algebra

CSE-4/562 Spring 2019

Textbook: Ch. 5.2, 15.4

Extended Relational Algebra

Extended Projection

Outer Join

NULL Values

Outer Join (⟗, ⟕, ⟖)

Outer Join

Sort / Limit

Sort

2-Way Sort

Pass 1: Create Sorted Runs

Pass 2: Merge Sorted Runs

Using More Memory

Replacement Sort

Replacement Sort

Replacement Sort

Replacement Sort

Replacement Sort

Replacement Sort

Replacement Sort

Replacement Sort

Aggregation

Normal Aggregates

Basic Aggregate Pattern

COUNT(*)

SUM(A)

AVG(A)

Basic Aggregate Pattern

Basic Aggregate Types

Group-By Aggregates

Alternative Grouping Algorithms

`NULL` Values