Overview: Optimize your implementation to support indexing and join algorithms.
Deadline: May 11 (Tentative)
Grade: Scored out of 10 points
- 50% Correctness
- 50% Efficiency
Waring: The values given here are still tentative.

This checkpoint is, in effect, a more rigorous form of Checkpoints 1, 2 and 3. The requirements are almost identical: We give you a query and some data, you evaluate the query on the data and give us a response as quickly as possible, with limited memory available to you.

We'll be expecting a more feature-complete submission. Your code will be evaluated on queries from TPC-H benchmark, which exercises a broader range of SQL features than the previous checkpoints did.

Join Ordering

The order in which you join tables together is incredibly important, and can change the runtime of your query by multiple orders of magnitude. Picking between different join orderings is incredibly important! However, to do so, you will need statistics about the data, something that won't really be feasible until later. Instead, here's a present for those of you paying attention. The tables in each FROM clause are ordered so that you will get our recommended join order by building a left-deep plan going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.

Preprocessing

You will have 100MB of memory and 5 minutes with all CREATE TABLE statements. The total datasize will be 500MB. The reference implementation uses this time to build indexes over the data – in-memory and/or on-disk. Students in prior years have come up with other creative ways to use this time. You can create indexes over the data, collect statistics to optimize your query plan, and reshape the data in a way you can most efficiently use it. After you read the last CREATE TABLE query, do NOT print the prompt until you complete the preprocessing because when you do, we will start to issue SELECT queries.

CREATE TABLE statements will include index suggestions, both via unique PRIMARY KEY attributes, foreign key REFERENCES attributes and non-unique INDEX fields. You can get access to both through the getIndexes() method of CreateTable, and also getColumnSpecStrings() in column definitions.

Query Rewriting

In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begin to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.

The checkpoint 4 reference implementation optimizer implements three improvements:

Sort-Merge Join: An implementation of sort-merge join for use on out-of-memory joins.
1-Pass Hash Join: An implementation of the in-memory hash join algorithm.
Index Nested-Loop Join: An implementation of index nested loop join algorithm.
Join Specialization: Rewrite Selection + CrossProduct operators into Hash Join operators.
Index Scanner: Search values with index lookup.

Grading Workflow

All .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). A main file that you can take as an example is given here. As before, the class edu.buffalo.www.cse4562.Main will be invoked, and a stream of semicolon-delimited queries will be printed to System.in (after you print out a prompt). Also, make sure that you use the path we provide you in --data argument. Hardcoding the location may cause problems.

For example (red text is entered by the user/grader):

bash> ls data
R.dat
S.dat
T.dat
bash> cat data/R.dat
1|1|5
1|2|6
2|3|7
bash> cat data/S.dat
1|2|6
3|3|2
3|5|2
bash> find {code root directory} -name \*.java -print > compile.list
bash> javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list
bash> java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main --data data/
$> CREATE TABLE R(A int, B int, C int);
$> CREATE TABLE S(D int, E int, F int);
$> SELECT B, C FROM R WHERE A = 1;
1|5
2|6
$> SELECT A, E FROM R, S WHERE R.A = S.D;
1|2
1|2

For this project, we will issue 5 queries to your program excluding CREATE TABLE queries. All of these queries will be timed, and they will be evaluated based on the correctness of the query results. Answering each query successfully will bring you 1 point each. An example file you will read the data from is given here. You will gain performance points for each query for matching or beating the reference implementation timewise. Also keep in mind that for ALL queries, the grader will time out and exit after 5 minutes. There is also a 100MB memory limit that will not allow you to load full tables and cross product them for joins.

	Points for Correctness	Points for Performance	Table Size
Query 1	1	1	~500 MB
Query 2	1	1	~500 MB
Query 3	1	1	~500 MB
Query 4	1	1	~500 MB
Query 5	1	1	~500 MB