Overview: Optimize your implementation to support join algorithms.
Deadline: May 17
Grade: Scored out of 10 points
- 50% Correctness
- 50% Efficiency

This checkpoint is, in effect, a more rigorous form of Checkpoints 1 and 2. The requirements are identical: We give you a query and some data, you evaluate the query on the data and give us a response as quickly as possible.

We'll be expecting a more feature-complete submission. Your code will be evaluated on more queries from TPC-H benchmark, which exercises a broader range of SQL features than the previous checkpoints did.

Join Ordering

The order in which you join tables together is incredibly important, and can change the runtime of your query by multiple orders of magnitude. Picking between different join orderings is incredibly important! However, to do so, you will need statistics about the data, something that won't really be feasible until later. Instead, here's a present for those of you paying attention. The tables in each FROM clause are ordered so that you will get our recommended join order by building a left-deep plan going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.

Query Rewriting

In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begins to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.

The checkpoint 3 reference implementation optimizer implements three improvements:

Sort-Merge Join: An implementation of sort-merge join for use on out-of-memory joins.
1-Pass Hash Join: An implementation of the in-memory hash join algorithm.
Join Specialization: Rewrite Selection + CrossProduct operators into Hash Join operators

Interface

Your code will be evaluated in exactly the same way as Project 1. Data will drawn from a 1GB (SF 1) TPC-H dataset.

java -cp build:jsqlparser.jar 
     dubstep.Main 
     --data [data] 
     [sqlfile1] [sqlfile2] ...

This example uses the following directories and files:

[data]: Table data stored in '|' separated files. As before, table names match the names provided in the matching CREATE TABLE with the .dat suffix.
[sqlfileX]: A file containing CREATE TABLE and SELECT statements, defining the schema of the dataset and the query to process

Grading

Your code will be subjected to a sequence of test cases and evaluated on speed and correctness. Note that unlike Project 1, you will neither receive a warning about, nor partial credit for out-of-order query results if the outermost query includes an ORDER BY clause.

Phase 1 (big queries) will be graded on a TPC-H SF 0.1 dataset (100 MB of raw text data). Phase 2 (limited memory) will be graded on datasets of varying sizes. Grades are assigned based on per-query thresholds:

0/10 (F): Your submission does not compile, does not produce correct output, or fails in some other way. Resubmission is highly encouraged.
5/10 (C): Your submission runs the test query faster than the C threshold (listed below for each query), and produces the correct output.
7.5/10 (B): Your submission runs the test query faster than the B threshold (listed below for each query), and produces the correct output.
10/10 (A): Your submission runs the test query faster than the A threshold (listed below for each query), and produces the correct output.

TPC-H Query	Phase 1 Runtimes		Phase 2 Runtimes	Phase 2 Scaling Factor
1	TBD	A	TBD	SF = 0.1
	TBD	B	TBD
	TBD	C	TBD
3	TBD	A	TBD	SF = 0.1
	TBD	B	TBD
	TBD	C	TBD
5	TBD	A	TBD	SF = 0.1
	TBD	B	TBD
	TBD	C	TBD
10	TBD	A	TBD	SF = 0.5
	TBD	B	TBD
	TBD	C	TBD
12	TBD	A	TBD	SF = 0.5
	TBD	B	TBD
	TBD	C	TBD