CSE-4/562 Spring 2021 - Checkpoint 4

Checkpoint 4

CSE-4/562 Spring 2021

April, 2021

Garcia-Molina/Ullman/Widom: Ch. 8.3-8.4, 14.1-14.2, 14.4

Checkpoint 4

Just like Checkpoint 3, but now...

  • Tighter Bounds
  • More Start-Up Time
  • Hints About the Table

Primary Key


    CREATE TABLE CUSTOMER(
      CUSTKEY INT, 
      NAME STRING, 
      ADDRESS STRING, 
      NATIONKEY INT, 
      PHONE STRING, 
      ACCTBAL FLOAT, 
      MKTSEGMENT STRING, 
      COMMENT STRING
    ) USING csv OPTIONS(
      path 'data/CUSTOMER.data', 
      delimiter = '|',
      primary_key = 'custkey'
    )
    

Primary Key


    primary_key = 'custkey',
    
or

    primary_key = 'orderkey,lineitem',
    

comma-separated list describing the primary key of the table

The Reference Implementation
In-Memory Tables
Primary Key Index
Index Scans
Ideas From Past Submissions
Index-Nested Loop Join
Materialized Views
Gather Statistics
Secondary Indexes

In-Memory Tables

Time to scan SF 0.1 LINEITEM

SourceTime
CSV on NVME SSD 0.88745s
   

 

What does that 1s include?

  • Disk to Ram
  • Buffering in the OS
  • IPC
  • Split on |
  • Parse Int, Float, Date
  • Iterate Over Each Tuple

In-Memory Tables

Time to scan SF 0.1 LINEITEM

SourceTime
CSV on NVME SSD 0.88745s
IndexedSeq[InternalRow] 0.018s

~30x speedup

Takeaway: Read data in at CREATE TABLE

Compiler


      Table(...)
    

Return an iterator over the preloaded table.

Incorporating Trees into Queries

$\sigma_C(R)$ and $(\ldots \bowtie_C R)$

Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$

Possible Implementations:

$\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$
Always works... but slow
$\pi_A\left(\sigma_{B = 1}( IndexScan(R,\;C < 3) ) \right)$
Requires a non-hash index on $C$
$\pi_A\left(\sigma_{C < 3}( IndexScan(R,\;B=1) ) \right)$
Requires a any index on $B$
$\pi_A\left( IndexScan(R,\;B = 1, C < 3) \right)$
Requires any index on $(B, C)$

Lexical Sort (Non-Hash Only)

Sort data on $(A, B, C, \ldots)$

First sort on $A$, $B$ is a tiebreaker for $A$,
$C$ is a tiebreaker for $B$, etc...

All of the $A$ values are adjacent.
Supports $\sigma_{A = a}$ or $\sigma_{A \geq b}$
For a specific $A$, all of the $B$ values are adjacent
Supports $\sigma_{A = a \wedge B = b}$ or $\sigma_{A = a \wedge B \geq b}$
For a specific $(A,B)$, all of the $C$ values are adjacent
Supports $\sigma_{A = a \wedge B = b \wedge C = c}$ or $\sigma_{A = a \wedge B = b \wedge C \geq c}$
...

For a query $\sigma_{c_1 \wedge \ldots \wedge c_N}(R)$

  1. For every $c_i \equiv (A = a)$: Do you have any index on $A$?
  2. For every $c_i \in \{\; (A \geq a), (A > a), (A \leq a), (A < a)\;\}$: Do you have a tree index on $A$?
  3. For every $c_i, c_j$, do you have an appropriate index?
  4. A simple table scan is also an option

Which one do we pick?

(Start picking arbitrarily, then experiment)

These are called "Access Paths"

Primary Key Index

Time to filter SF 0.1 LINEITEM for one orderkey

SourceTime
CSV on NVME SSD 0.9196s
IndexedSeq[InternalRow] 0.0624s
   

 

What does that 62ms include?

  • EqualTo(...).eval(...) for each row
  • Iterate Over Each Tuple

Primary Key Index

Time to filter SF 0.1 LINEITEM for one orderkey

SourceTime
CSV on NVME SSD 0.9196s
IndexedSeq[InternalRow] 0.0624s
Sorted IndexedSeq[InternalRow] + Bin Search 0.0008s

~80x speedup

Takeaway: Sort on primary key and binary search.

Compiler


      Filter(expression, Table(...))
    

If expression is a ...

If expression is a ...

EqualTo between the Table key and a constant
Binary search for the key!
[Greater|Less]Than[OrEquals] between the Table key and a constant
Binary search for the lower/upper bound
Greater and Lower
Binary search for the lower and upper bound
One of the above and more
Binary search + Filter the rest

But TPC-H doesn't have filters on keys...

Also Index


      USING csv OPTIONS(
      path '../TPCH/LINEITEM.csv', 
      delimiter = '|',
      primary_key = 'orderkey,linenumber',
      tree_index = 'shipdate',
      hash_index = 'linestatus|shipmode'
    )
    

tree_index and hash_index are |-separated lists of ,-separated indexes.

Questions?