- Checkpoint 4

Checkpoint 4

April, 2021

Garcia-Molina/Ullman/Widom: Ch. 8.3-8.4, 14.1-14.2, 14.4

Checkpoint 4

Just like Checkpoint 3, but now...

Tighter Bounds
More Start-Up Time
Hints About the Table

Primary Key


    CREATE TABLE CUSTOMER(
      CUSTKEY INT, 
      NAME STRING, 
      ADDRESS STRING, 
      NATIONKEY INT, 
      PHONE STRING, 
      ACCTBAL FLOAT, 
      MKTSEGMENT STRING, 
      COMMENT STRING
    ) USING csv OPTIONS(
      path 'data/CUSTOMER.data', 
      delimiter = '|',
      primary_key = 'custkey'
    )

Primary Key


    primary_key = 'custkey',


    primary_key = 'orderkey,lineitem',

comma-separated list describing the primary key of the table

The Reference Implementation: In-Memory Tables; Primary Key Index; Index Scans
Ideas From Past Submissions: Index-Nested Loop Join; Materialized Views; Gather Statistics; Secondary Indexes

In-Memory Tables

Time to scan SF 0.1 LINEITEM

Source	Time
CSV on NVME SSD	0.88745s

What does that 1s include?

Disk to Ram
Buffering in the OS
IPC
Split on |
Parse Int, Float, Date
Iterate Over Each Tuple

In-Memory Tables

Time to scan SF 0.1 LINEITEM

Source	Time
CSV on NVME SSD	0.88745s
IndexedSeq[InternalRow]	0.018s

~30x speedup

Takeaway: Read data in at CREATE TABLE

Compiler


      Table(...)

Return an iterator over the preloaded table.

Incorporating Trees into Queries

$\sigma_C(R)$ and $(\ldots \bowtie_C R)$

Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$

Possible Implementations:

$\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$: Always works... but slow
$\pi_A\left(\sigma_{B = 1}( IndexScan(R,\;C < 3) ) \right)$: Requires a non-hash index on $C$
$\pi_A\left(\sigma_{C < 3}( IndexScan(R,\;B=1) ) \right)$: Requires a any index on $B$
$\pi_A\left( IndexScan(R,\;B = 1, C < 3) \right)$: Requires any index on $(B, C)$

Lexical Sort (Non-Hash Only)

Sort data on $(A, B, C, \ldots)$

First sort on $A$, $B$ is a tiebreaker for $A$,
$C$ is a tiebreaker for $B$, etc...

All of the $A$ values are adjacent.: Supports $\sigma_{A = a}$ or $\sigma_{A \geq b}$
For a specific $A$, all of the $B$ values are adjacent: Supports $\sigma_{A = a \wedge B = b}$ or $\sigma_{A = a \wedge B \geq b}$
For a specific $(A,B)$, all of the $C$ values are adjacent: Supports $\sigma_{A = a \wedge B = b \wedge C = c}$ or $\sigma_{A = a \wedge B = b \wedge C \geq c}$
...

For a query $\sigma_{c_1 \wedge \ldots \wedge c_N}(R)$

For every $c_i \equiv (A = a)$: Do you have any index on $A$?
For every $c_i \in \{\; (A \geq a), (A > a), (A \leq a), (A < a)\;\}$: Do you have a tree index on $A$?
For every $c_i, c_j$, do you have an appropriate index?
A simple table scan is also an option

Which one do we pick?

(Start picking arbitrarily, then experiment)

These are called "Access Paths"

Primary Key Index

Time to filter SF 0.1 LINEITEM for one orderkey

Source	Time
CSV on NVME SSD	0.9196s
IndexedSeq[InternalRow]	0.0624s

What does that 62ms include?

EqualTo(...).eval(...) for each row
Iterate Over Each Tuple

Primary Key Index

Time to filter SF 0.1 LINEITEM for one orderkey

Source	Time
CSV on NVME SSD	0.9196s
IndexedSeq[InternalRow]	0.0624s
Sorted IndexedSeq[InternalRow] + Bin Search	0.0008s

~80x speedup

Takeaway: Sort on primary key and binary search.

Compiler


      Filter(expression, Table(...))

If expression is a ...

EqualTo between the Table key and a constant: Binary search for the key!
[Greater|Less]Than[OrEquals] between the Table key and a constant: Binary search for the lower/upper bound; Greater and Lower; Binary search for the lower and upper bound; One of the above and more; Binary search + Filter the rest

But TPC-H doesn't have filters on keys...

Also Index


      USING csv OPTIONS(
      path '../TPCH/LINEITEM.csv', 
      delimiter = '|',
      primary_key = 'orderkey,linenumber',
      tree_index = 'shipdate',
      hash_index = 'linestatus|shipmode'
    )

tree_index and hash_index are |-separated lists of ,-separated indexes.

Questions?