CSE 4/562 - Database Systems

Query Evaluation

CSE 4/562 – Database Systems

February 12, 2018

Query Evaluation Styles

All-At-Once (Collections)
Bottom-up, one operator at a time.
Volcano-Style (Iterators)
Operators "request" one tuple at a time from children.
Push-Style (Buffers)
Operators continuously produce/consume tuples.

Basic Mindset


  r = get_table("R")

  s = get_table("S")
  
  temp1 = apply_join(r, s, "R.B = S.B")
  
  temp2 = apply_select(temp1, "S.C = 10")
  
  result = apply_projection(temp2, "R.A")
          

Basic Mindset


      def build_tree(operator):

        if """ operator is a base table """:
          return get_table(...)

        elif """ operator is a selection """:
          return apply_select(operator.child, operator.condition)

        elif """ handle remaining cases similarly """:
          

Select

$$\sigma_{A \neq 3} R$$

AB
12
34
56

Select


                  def apply_select(input, condition)
                    result = []

                    for row in input:
                      if condition(row):
                        result += [row]

                    return result;
          

(All-At-Once)

Select

$$\sigma_{A \neq 3} R$$

AB
getNext()for row in input:
12return row;
getNext()for row in input:
34X
56return row;
getNext()for row in input:
Nonereturn None;

Select

Project

Union

Cross


                    def apply_cross(lhs, rhs):
                      result = []

                      for r in lhs:
                        for s in rhs:
                          result += [r + s]

                      return result
          

Cross

What's the complexity of this cross-product algorithm?

... in terms of compute

... in terms of IOs

Cross Product Problems

Need to scan the inner relation multiple times!
Load data intelligently to mitigate expensive IOs
Every tuple needs to be paired with every other tuple!
Exploit join conditions to minimize pairs of tuples

Preloading Data

Nested-Loop Join


                    def apply_cross(lhs, rhs):
                      result = []

                      while r = lhs.next():
                        while s = rhs.next():
                          result += [r + s]
                        rhs.reset()

                      return result
          

Nested-Loop Join

Problem: We need to evaluate rhs iterator
once per record in lhs

Preloading Data

Naive Solution: Preload records from lhs


                    def apply_cross(lhs, rhs):
                      result = []
                      rhs_preloaded = []
                      
                      while s = rhs.next():
                        rhs_preloaded += [s]

                      while r = lhs.next():
                        for s in rhs_preloaded:
                          result += [r + s]

                      return result
          

Any problems with this?

Preloading Data

Better Solution: Load both lhs and rhs records in blocks.


                    def apply_cross(lhs, rhs):
                      result = []

                      while r_block = lhs.take(100):
                        while s_block = rhs.take(100):
                          for r in r_block:
                            for s in s_block: 
                              result += [r + s]
                        rhs.reset()

                      return result
          

Block-Nested Loop Join

How big should the blocks be?

What is the IO complexity of the algorithm?

Join Conditions

Problem: Naively, any tuple matches any other

Join Conditions

Solution: First organize the data

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

Sort/Merge Join
Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join)
Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join)
Partition both sides so that tuples don't join across partitions.

Sort/Merge Join

Sort/Merge Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Low Memory
Only needs to keep ~2 rows in memory at a time (not counting sort).
Low Added CPU/IO Cost
Only requires 1 scan over each input (not counting sort).

Hash Functions

  • A hash function is a function that maps a large data value to a small fixed-size value
    • Typically is deterministic & pseudorandom
  • Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …
  • Examples: MD5, SHA1, SHA2
    • MD5() part of OpenSSL (on most OSX / Linux / Unix)
  • Can map h(k) to range [0,N) with h(k) % N (modulus)

Hash Functions

$$h(X) \mod N$$

  • Pseudorandom output between $[0, N)$
  • Always the same output for a given $X$

1-Pass Hash Join

1-Pass Hash Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Moderate-High Memory
Keeps 1 relation in memory
Low Added CPU/IO Cost
Only requires 1 scan over each input.

Can use other in-memory indexes to support other join conditions.

2-Pass Hash Join

2-Pass Hash Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Low Memory
Never need more than 1 pair of partitions in memory
High IO Cost
Every record gets written out to disk, and back in.

Can partition on data-values to support other types of queries.

Why is it important that the hash function is pseudorandom?

Next Class

More operators, More algorithms