CSE 4/562 - Database Systems

Index Scans, Aggregates

CSE 4/562 – Database Systems

February 26, 2018

General Query Optimizers

  1. Apply blind heuristics (e.g., push down selections)
  2. Enumerate all possible execution plans by varying (or for a reasonable subset)
    • Join/Union Evaluation Order (commutativity, associativity, distributivity)
    • Algorithms for Joins, Aggregates, Sort, Distinct, and others
    • Data Access Paths
  3. Estimate the cost of each execution plan
  4. Pick the execution plan with the lowest cost

Data Access Paths

Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$

Possible Implementations:

  • Full Table Scan
  • Index Scan on Tree/Hash Index over $B$
  • Index Scan on Tree Index over $C$
  • Index Scan on Tree Index over $B,C$

Full Table Scan

  1. Project down to $A$, reading from...
  2. Filter on $B = 1 \wedge C < 3$, reading from...
  3. All of the rows in $R$

Index Scan on Tree/Hash Index over $B$

  1. Project down to $A$, reading from...
  2. Filter on $C < 3$, reading from...
  3. The index which provides all rows where $R.B = 1$

Index Scan on Tree Index over $C$

  1. Project down to $A$, reading from...
  2. Filter on $B = 1$, reading from...
  3. The index which provides all rows where $R.C < 3$

Index Scan on Tree Index over $B, C$

Lexical Sort: First sort by B, with C as a tiebreaker.

  1. Project down to $A$, reading from...
  2. The index which provides all rows between
    $\left\lt 1, -\infty\right\gt \lt \left\lt R.B, R.C\right\gt \lt \left\lt 1, 3\right\gt$

Which index to use (if several are available)?

Strategies for Implementing $(\ldots \bowtie_{c} S)$

Sort/Merge Join
Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join)
Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join)
Partition both sides so that tuples don't join across partitions.
Index Nested Loop Join
Use an existing index instead of buildling one.

Index Nested Loop Join

To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$
  1. Read One Row of $R$
  2. Get the value of $R.A$
  3. Start index scan on $S.B > [R.A]$
  4. Return rows as normal
Aggregation

Aggregation

Normal Aggregates
SELECT COUNT(*) FROM R
SELECT SUM(A) FROM R
Group-By Aggregates
SELECT A, SUM(B) FROM R GROUP BY A
Distinct
SELECT DISTINCT A FROM R
SELECT A FROM R GROUP BY A

Normal Aggregates

TREE_IDSPC_COMMONBORONAMETREE_DBH
180683'red maple''Queens'3
315986'pin oak''Queens'21
204026'honeylocust''Brooklyn'3
204337'honeylocust''Brooklyn'10
189565'American linden''Brooklyn'21
... and 683783 more
SELECT COUNT(*) FROM TREES
TREE_IDSPC_COMMONBORONAMETREE_DBH
COUNT = 0
180683'red maple''Queens'3
COUNT = 1
315986'pin oak''Queens'21
COUNT = 2
204026'honeylocust''Brooklyn'3
COUNT = 3
204337'honeylocust''Brooklyn'10
COUNT = 4
189565'American linden''Brooklyn'21
COUNT = 5
... and 683783 more
COUNT = 683788
SELECT SUM(TREE_DBH) FROM TREES
TREE_IDSPC_COMMONBORONAMETREE_DBH
SUM = 0
180683'red maple''Queens'3
SUM = 3
315986'pin oak''Queens'21
SUM = 24
204026'honeylocust''Brooklyn'3
SUM = 27
204337'honeylocust''Brooklyn'10
SUM = 37
189565'American linden''Brooklyn'21
SUM = 58
... and 683783 more

Basic Aggregate Pattern

This is also sometimes called a "fold"

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator

COUNT(*)

Init
$0$
Fold(Accum, New)
$Accum + 1$

SUM(A)

Init
$0$
Fold(Accum, New)
$Accum + New$

AVG(A)

Init
$\{ sum = 0, count = 0 \}$
Fold(Accum, New)
$\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
Finalize(Accum)
$\frac{Accum.sum}{Accum.count}$

Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.

Basic Aggregate Types

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Distributive
Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
Algebraic
Finite-sized accumulator but needs a finalize (AVG)
Holistic
Unbounded accumulator (MEDIAN)

Group-By Aggregates

SELECT SPC_COMMON, COUNT(*) FROM TREES GROUP BY SPC_COMMON

Naive Idea: Keep a separate accumulator for each group

TREE_IDSPC_COMMONBORONAMETREE_DBH
{}
180683'red maple''Queens'3
{ 'red maple' = 1 }
204337'honeylocust''Brooklyn'10
{ 'red maple' = 1, 'honeylocust' = 1 }
315986'pin oak''Queens'21
{ 'red maple' = 1, 'honeylocust' = 1, 'pin oak' = 1 }
204026'honeylocust''Brooklyn'3
{ 'red maple' = 1, 'honeylocust' = 2, 'pin oak' = 1 }

What could go wrong?

Alternative Grouping Algorithms

2-pass Hash Aggregate
Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
Sort-Aggregate
Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.
TREE_IDSPC_COMMONBORONAMETREE_DBH
{}
204337'honeylocust''Brooklyn'10
{ 'honeylocust' = 1 }
204026'honeylocust''Brooklyn'3
{ 'honeylocust' = 2 }
... and more
315986'pin oak''Queens'21
{ 'honeylocust' = 3206, 'pin oak' = 1 }
... and more
180683'red maple''Queens'3
{ 'pin oak' = 53814, 'red maple' = 1 }