CSE 4/562 - Database Systems

- Apply blind heuristics (e.g., push down selections)
- Enumerate all possible
*execution plans*by varying (or for a reasonable subset)- Join/Union Evaluation Order (commutativity, associativity, distributivity)
- Algorithms for Joins, Aggregates, Sort, Distinct, and others
- Data Access Paths

- Estimate the cost of each execution plan
- Pick the execution plan with the lowest cost

Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$

Possible Implementations:

- Full Table Scan
- Index Scan on Tree/Hash Index over $B$
- Index Scan on Tree Index over $C$
- Index Scan on Tree Index over $B,C$

- Project down to $A$, reading from...
- Filter on $B = 1 \wedge C < 3$, reading from...
- All of the rows in $R$

- Project down to $A$, reading from...
- Filter on $C < 3$, reading from...
- The index which provides all rows where $R.B = 1$

- Project down to $A$, reading from...
- Filter on $B = 1$, reading from...
- The index which provides all rows where $R.C < 3$

Lexical Sort: First sort by B, with C as a tiebreaker.

- Project down to $A$, reading from...
- The index which provides all rows between

$\left\lt 1, -\infty\right\gt \lt \left\lt R.B, R.C\right\gt \lt \left\lt 1, 3\right\gt$

Which index to use (if several are available)?

- Sort/Merge Join
- Sort all of the data upfront, then scan over both sides.
- In-Memory Index Join (1-pass Hash; Hash Join)
- Build an in-memory index on one table, scan the other.
- Partition Join (2-pass Hash; External Hash Join)
- Partition both sides so that tuples don't join across partitions.
- Index Nested Loop Join
- Use an
*existing*index instead of buildling one.

- Read One Row of $R$
- Get the value of $R.A$
- Start index scan on $S.B > [R.A]$
- Return rows as normal

- Normal Aggregates
`SELECT COUNT(*) FROM R`

`SELECT SUM(A) FROM R`

- Group-By Aggregates
`SELECT A, SUM(B) FROM R GROUP BY A`

- Distinct
`SELECT DISTINCT A FROM R`

`SELECT A FROM R GROUP BY A`

TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|

180683 | 'red maple' | 'Queens' | 3 |

315986 | 'pin oak' | 'Queens' | 21 |

204026 | 'honeylocust' | 'Brooklyn' | 3 |

204337 | 'honeylocust' | 'Brooklyn' | 10 |

189565 | 'American linden' | 'Brooklyn' | 21 |

... and 683783 more |

`SELECT COUNT(*) FROM TREES`

TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|

COUNT = 0 | |||

180683 | 'red maple' | 'Queens' | 3 |

COUNT = 1 | |||

315986 | 'pin oak' | 'Queens' | 21 |

COUNT = 2 | |||

204026 | 'honeylocust' | 'Brooklyn' | 3 |

COUNT = 3 | |||

204337 | 'honeylocust' | 'Brooklyn' | 10 |

COUNT = 4 | |||

189565 | 'American linden' | 'Brooklyn' | 21 |

COUNT = 5 | |||

... and 683783 more | |||

COUNT = 683788 |

`SELECT SUM(TREE_DBH) FROM TREES`

TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|

SUM = 0 | |||

180683 | 'red maple' | 'Queens' | 3 |

SUM = 3 | |||

315986 | 'pin oak' | 'Queens' | 21 |

SUM = 24 | |||

204026 | 'honeylocust' | 'Brooklyn' | 3 |

SUM = 27 | |||

204337 | 'honeylocust' | 'Brooklyn' | 10 |

SUM = 37 | |||

189565 | 'American linden' | 'Brooklyn' | 21 |

SUM = 58 | |||

... and 683783 more |

This is also sometimes called a "fold"

- Init
- Define a starting value for the accumulator
- Fold(Accum, New)
- Merge a new value into the accumulator

- Init
- $0$
- Fold(Accum, New)
- $Accum + 1$

- Init
- $0$
- Fold(Accum, New)
- $Accum + New$

- Init
- $\{ sum = 0, count = 0 \}$
- Fold(Accum, New)
- $\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
- Finalize(Accum)
- $\frac{Accum.sum}{Accum.count}$

- Init
- Define a starting value for the accumulator
- Fold(Accum, New)
- Merge a new value into the accumulator
- Finalize(Accum)
- Extract the aggregate from the accumulator.

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

- Distributive
- Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
- Algebraic
- Finite-sized accumulator but needs a finalize (AVG)
- Holistic
- Unbounded accumulator (MEDIAN)

`SELECT SPC_COMMON, COUNT(*) FROM TREES GROUP BY SPC_COMMON`

**Naive Idea:** Keep a separate accumulator for each group

TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|

{} | |||

180683 | 'red maple' | 'Queens' | 3 |

{ 'red maple' = 1 } | |||

204337 | 'honeylocust' | 'Brooklyn' | 10 |

{ 'red maple' = 1, 'honeylocust' = 1 } | |||

315986 | 'pin oak' | 'Queens' | 21 |

{ 'red maple' = 1, 'honeylocust' = 1, 'pin oak' = 1 } | |||

204026 | 'honeylocust' | 'Brooklyn' | 3 |

{ 'red maple' = 1, 'honeylocust' = 2, 'pin oak' = 1 } |

What could go wrong?

- 2-pass Hash Aggregate
- Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
- Sort-Aggregate
- Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.

TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|

{} | |||

204337 | 'honeylocust' | 'Brooklyn' | 10 |

{ 'honeylocust' = 1 } | |||

204026 | 'honeylocust' | 'Brooklyn' | 3 |

{ 'honeylocust' = 2 } | |||

... and more | |||

315986 | 'pin oak' | 'Queens' | 21 |

{ 'honeylocust' = 3206, 'pin oak' = 1 } | |||

... and more | |||

180683 | 'red maple' | 'Queens' | 3 |

{ 'pin oak' = 53814, 'red maple' = 1 } |