CSE 4/562 - Database Systems

- All-At-Once (Collections)
- Bottom-up, one operator at a time.
- Volcano-Style (Iterators)
- Operators "request" one tuple at a time from children.
- Push-Style (Buffers)
- Operators continuously produce/consume tuples.

```
r = get_table("R")
s = get_table("S")
temp1 = apply_join(r, s, "R.B = S.B")
temp2 = apply_select(temp1, "S.C = 10")
result = apply_projection(temp2, "R.A")
```

```
def build_tree(operator):
if """ operator is a base table """:
return get_table(...)
elif """ operator is a selection """:
return apply_select(operator.child, operator.condition)
elif """ handle remaining cases similarly """:
```

$$\sigma_{A \neq 3} R$$

A | B |
---|---|

1 | 2 |

3 | 4 |

5 | 6 |

```
def apply_select(input, condition)
result = []
for row in input:
if condition(row):
result += [row]
return result;
```

(All-At-Once)

$$\sigma_{A \neq 3} R$$

A | B | |
---|---|---|

`getNext()` | `for row in input:` | |

1 | 2 | `return row;` |

`getNext()` | `for row in input:` | |

3 | 4 | X |

5 | 6 | `return row;` |

`getNext()` | `for row in input:` | |

`None` | `return None;` |

```
def apply_cross(lhs, rhs):
result = []
for r in lhs:
for s in rhs:
result += [r + s]
return result
```

What's the complexity of this cross-product algorithm?

... in terms of compute

... in terms of IOs

- Need to scan the inner relation multiple times!
- Load data intelligently to mitigate expensive IOs
- Every tuple needs to be paired with every other tuple!
- Exploit join conditions to minimize pairs of tuples

Nested-Loop Join

```
def apply_cross(lhs, rhs):
result = []
while r = lhs.next():
while s = rhs.next():
result += [r + s]
rhs.reset()
return result
```

**Problem**: We need to evaluate `rhs`

iterator

once per record in `lhs`

**Naive Solution**: Preload records from `lhs`

```
def apply_cross(lhs, rhs):
result = []
rhs_preloaded = []
while s = rhs.next():
rhs_preloaded += [s]
while r = lhs.next():
for s in rhs_preloaded:
result += [r + s]
return result
```

Any problems with this?

**Better Solution**: Load both `lhs`

and `rhs`

records in blocks.

```
def apply_cross(lhs, rhs):
result = []
while r_block = lhs.take(100):
while s_block = rhs.take(100):
for r in r_block:
for s in s_block:
result += [r + s]
rhs.reset()
return result
```

How big should the blocks be?

What is the IO complexity of the algorithm?

**Problem**: Naively, any tuple matches any other

**Solution**: First organize the data

- Sort/Merge Join
- Sort all of the data upfront, then scan over both sides.
- In-Memory Index Join (1-pass Hash; Hash Join)
- Build an in-memory index on one table, scan the other.
- Partition Join (2-pass Hash; External Hash Join)
- Partition both sides so that tuples don't join across partitions.

- Limited Queries
- Only supports join conditions of the form $R.A = S.B$
- Low Memory
- Only needs to keep ~2 rows in memory at a time (not counting sort).
- Low Added CPU/IO Cost
- Only requires 1 scan over each input (not counting sort).

- A hash function is a function that maps a large data value to a small fixed-size value
- Typically is deterministic & pseudorandom

- Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …
- Examples: MD5, SHA1, SHA2
- MD5() part of OpenSSL (on most OSX / Linux / Unix)

- Can map h(k) to range [0,N) with h(k) % N (modulus)

$$h(X) \mod N$$

- Pseudorandom output between $[0, N)$
- Always the same output for a given $X$

- Limited Queries
- Only supports join conditions of the form $R.A = S.B$
- Moderate-High Memory
- Keeps 1 relation in memory
- Low Added CPU/IO Cost
- Only requires 1 scan over each input.

Can use other in-memory indexes to support other join conditions.

- Limited Queries
- Only supports join conditions of the form $R.A = S.B$
- Low Memory
- Never need more than 1 pair of partitions in memory
- High IO Cost
- Every record gets written out to disk, and back in.

Can partition on data-values to support other types of queries.

Why is it important that the hash function is pseudorandom?

More operators, More algorithms