CSE 4/562 - Database Systems

# Index Scans, Aggregates

### General Query Optimizers

1. Apply blind heuristics (e.g., push down selections)
2. Enumerate all possible execution plans by varying (or for a reasonable subset)
• Join/Union Evaluation Order (commutativity, associativity, distributivity)
• Algorithms for Joins, Aggregates, Sort, Distinct, and others
• Data Access Paths
3. Estimate the cost of each execution plan
4. Pick the execution plan with the lowest cost

### Data Access Paths

Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$

Possible Implementations:

• Full Table Scan
• Index Scan on Tree/Hash Index over $B$
• Index Scan on Tree Index over $C$
• Index Scan on Tree Index over $B,C$

### Full Table Scan

1. Project down to $A$, reading from...
2. Filter on $B = 1 \wedge C < 3$, reading from...
3. All of the rows in $R$

### Index Scan on Tree/Hash Index over $B$

1. Project down to $A$, reading from...
2. Filter on $C < 3$, reading from...
3. The index which provides all rows where $R.B = 1$

### Index Scan on Tree Index over $C$

1. Project down to $A$, reading from...
2. Filter on $B = 1$, reading from...
3. The index which provides all rows where $R.C < 3$

### Index Scan on Tree Index over $B, C$

Lexical Sort: First sort by B, with C as a tiebreaker.

1. Project down to $A$, reading from...
2. The index which provides all rows between
$\left\lt 1, -\infty\right\gt \lt \left\lt R.B, R.C\right\gt \lt \left\lt 1, 3\right\gt$

Which index to use (if several are available)?

### Strategies for Implementing $(\ldots \bowtie_{c} S)$

Sort/Merge Join
Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join)
Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join)
Partition both sides so that tuples don't join across partitions.
Index Nested Loop Join
Use an existing index instead of buildling one.

### Index Nested Loop Join

To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$
1. Read One Row of $R$
2. Get the value of $R.A$
3. Start index scan on $S.B > [R.A]$
4. Return rows as normal
Aggregation

### Aggregation

Normal Aggregates
SELECT COUNT(*) FROM R
SELECT SUM(A) FROM R
Group-By Aggregates
SELECT A, SUM(B) FROM R GROUP BY A
Distinct
SELECT DISTINCT A FROM R
SELECT A FROM R GROUP BY A

### Normal Aggregates

TREE_IDSPC_COMMONBORONAMETREE_DBH
180683'red maple''Queens'3
315986'pin oak''Queens'21
204026'honeylocust''Brooklyn'3
204337'honeylocust''Brooklyn'10
189565'American linden''Brooklyn'21
... and 683783 more
SELECT COUNT(*) FROM TREES
TREE_IDSPC_COMMONBORONAMETREE_DBH
COUNT = 0
180683'red maple''Queens'3
COUNT = 1
315986'pin oak''Queens'21
COUNT = 2
204026'honeylocust''Brooklyn'3
COUNT = 3
204337'honeylocust''Brooklyn'10
COUNT = 4
189565'American linden''Brooklyn'21
COUNT = 5
... and 683783 more
COUNT = 683788
SELECT SUM(TREE_DBH) FROM TREES
TREE_IDSPC_COMMONBORONAMETREE_DBH
SUM = 0
180683'red maple''Queens'3
SUM = 3
315986'pin oak''Queens'21
SUM = 24
204026'honeylocust''Brooklyn'3
SUM = 27
204337'honeylocust''Brooklyn'10
SUM = 37
189565'American linden''Brooklyn'21
SUM = 58
... and 683783 more

### Basic Aggregate Pattern

This is also sometimes called a "fold"

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator

### COUNT(*)

Init
$0$
Fold(Accum, New)
$Accum + 1$

### SUM(A)

Init
$0$
Fold(Accum, New)
$Accum + New$

### AVG(A)

Init
$\{ sum = 0, count = 0 \}$
Fold(Accum, New)
$\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
Finalize(Accum)
$\frac{Accum.sum}{Accum.count}$

### Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.

### Basic Aggregate Types

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Distributive
Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
Algebraic
Finite-sized accumulator but needs a finalize (AVG)
Holistic
Unbounded accumulator (MEDIAN)

### Group-By Aggregates

SELECT SPC_COMMON, COUNT(*) FROM TREES GROUP BY SPC_COMMON

Naive Idea: Keep a separate accumulator for each group

TREE_IDSPC_COMMONBORONAMETREE_DBH
{}
180683'red maple''Queens'3
{ 'red maple' = 1 }
204337'honeylocust''Brooklyn'10
{ 'red maple' = 1, 'honeylocust' = 1 }
315986'pin oak''Queens'21
{ 'red maple' = 1, 'honeylocust' = 1, 'pin oak' = 1 }
204026'honeylocust''Brooklyn'3
{ 'red maple' = 1, 'honeylocust' = 2, 'pin oak' = 1 }

What could go wrong?

### Alternative Grouping Algorithms

2-pass Hash Aggregate
Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
Sort-Aggregate
Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.
TREE_IDSPC_COMMONBORONAMETREE_DBH
{}
204337'honeylocust''Brooklyn'10
{ 'honeylocust' = 1 }
204026'honeylocust''Brooklyn'3
{ 'honeylocust' = 2 }
... and more
315986'pin oak''Queens'21
{ 'honeylocust' = 3206, 'pin oak' = 1 }
... and more
180683'red maple''Queens'3
{ 'pin oak' = 53814, 'red maple' = 1 }