Query Evaluation

CSE 4/562 – Database Systems

February 12, 2018

Query Evaluation Styles

All-At-Once (Collections): Bottom-up, one operator at a time.
Volcano-Style (Iterators): Operators "request" one tuple at a time from children.
Push-Style (Buffers): Operators continuously produce/consume tuples.

Basic Mindset


  r = get_table("R")

  s = get_table("S")
  
  temp1 = apply_join(r, s, "R.B = S.B")
  
  temp2 = apply_select(temp1, "S.C = 10")
  
  result = apply_projection(temp2, "R.A")

Basic Mindset


      def build_tree(operator):

        if """ operator is a base table """:
          return get_table(...)

        elif """ operator is a selection """:
          return apply_select(operator.child, operator.condition)

        elif """ handle remaining cases similarly """:

Select

$$\sigma_{A \neq 3} R$$

A	B
1	2
3	4
5	6

Select


                  def apply_select(input, condition)
                    result = []

                    for row in input:
                      if condition(row):
                        result += [row]

                    return result;

(All-At-Once)

Select

$$\sigma_{A \neq 3} R$$

A	B
`getNext()`		`for row in input:`
1	2	`return row;`
`getNext()`		`for row in input:`
3	4	X
5	6	`return row;`
`getNext()`		`for row in input:`
`None`		`return None;`

Select

Project

Union

Cross


                    def apply_cross(lhs, rhs):
                      result = []

                      for r in lhs:
                        for s in rhs:
                          result += [r + s]

                      return result

Cross

What's the complexity of this cross-product algorithm?

... in terms of compute

... in terms of IOs

Cross Product Problems

Need to scan the inner relation multiple times!: Load data intelligently to mitigate expensive IOs
Every tuple needs to be paired with every other tuple!: Exploit join conditions to minimize pairs of tuples

Preloading Data

Nested-Loop Join


                    def apply_cross(lhs, rhs):
                      result = []

                      while r = lhs.next():
                        while s = rhs.next():
                          result += [r + s]
                        rhs.reset()

                      return result

Nested-Loop Join

Problem: We need to evaluate rhs iterator
once per record in lhs

Preloading Data

Naive Solution: Preload records from lhs


                    def apply_cross(lhs, rhs):
                      result = []
                      rhs_preloaded = []
                      
                      while s = rhs.next():
                        rhs_preloaded += [s]

                      while r = lhs.next():
                        for s in rhs_preloaded:
                          result += [r + s]

                      return result

Any problems with this?

Preloading Data

Better Solution: Load both lhs and rhs records in blocks.


                    def apply_cross(lhs, rhs):
                      result = []

                      while r_block = lhs.take(100):
                        while s_block = rhs.take(100):
                          for r in r_block:
                            for s in s_block: 
                              result += [r + s]
                        rhs.reset()

                      return result

Block-Nested Loop Join

How big should the blocks be?

What is the IO complexity of the algorithm?

Join Conditions

Problem: Naively, any tuple matches any other

Join Conditions

Solution: First organize the data

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

Sort/Merge Join: Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join): Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join): Partition both sides so that tuples don't join across partitions.

Sort/Merge Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Low Memory: Only needs to keep ~2 rows in memory at a time (not counting sort).
Low Added CPU/IO Cost: Only requires 1 scan over each input (not counting sort).

Hash Functions

A hash function is a function that maps a large data value to a small fixed-size value
- Typically is deterministic & pseudorandom
Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …
Examples: MD5, SHA1, SHA2
- MD5() part of OpenSSL (on most OSX / Linux / Unix)
Can map h(k) to range [0,N) with h(k) % N (modulus)

Hash Functions

$$h(X) \mod N$$

Pseudorandom output between $[0, N)$
Always the same output for a given $X$

1-Pass Hash Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Moderate-High Memory: Keeps 1 relation in memory
Low Added CPU/IO Cost: Only requires 1 scan over each input.

Can use other in-memory indexes to support other join conditions.

2-Pass Hash Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Low Memory: Never need more than 1 pair of partitions in memory
High IO Cost: Every record gets written out to disk, and back in.

Can partition on data-values to support other types of queries.

Why is it important that the hash function is pseudorandom?

Next Class

More operators, More algorithms