Checkpoint 3

Overview: New SQL features (Aggregation), Limited Memory, Faster Performance, Different Join Algorithms
Deadline: April 13
Grade: 15% of Project Component
- 8% Correctness
- 7% Efficiency

This project follows the same outline as Checkpoint 1 and 2. Your code gets SQL queries and is expected to answer them. You are expected to implement all the features from Checkpoint 1 and 2. Additionally, there are two key differences:

Queries may now include a GROUP BY clause, and MIN(), MAX(), SUM(), COUNT(), AVG() functions.
You will be expected to process queries faster, and use less memory.

Grouping Data

Just like Order-by, Group-by aggregates are also a blocking operator. If you run out of memory for the groups, you will need to implement a memory-aware grouping operator. One idea is to re-use the sort operator to group values together and use the sorted grouping technique. In the queries, there can only be one or two attributes in the group-by clause, so you do not need to handle unlimited number of group-by attributes.

Optimization/Query Rewriting

In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begins to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). You should have implemented selection pushdown for Checkpoint 2. The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.

Another optimization that is always good is projection pushdown operation. Essentially, you only read the attributes that you will need in the query from the each database file, and discard all the attributes that you will not use. In practice, it is expensive to copy the values of a tuple into a new tuple. This is especially helpful when your operator changes the schema of the input tuple, and outputs a tuple with a different schema (i.e., Cross Product and Join). Also, you will save considerable memory space with this improvement.

Grading Workflow

All .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). A main file that you can take as an example is given here. As before, the class edu.buffalo.www.cse4562.Main will be invoked, and a stream of semicolon-delimited queries will be printed to System.in (after you print out a prompt). Also, make sure that you use the path we provide you in --data argument. Hardcoding the location may cause problems.

For example (red text is entered by the user/grader):

bash> ls data
R.dat
S.dat
T.dat
bash> cat data/R.dat
1|1|5
1|2|6
2|3|7
bash> cat data/S.dat
1|2|6
3|3|2
3|5|2
bash> find {code root directory} -name \*.java -print > compile.list
bash> javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list
bash> java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main --data data/
$> CREATE TABLE R(A int, B int, C int);
$> CREATE TABLE S(D int, E int, F int);
$> SELECT B, C FROM R WHERE A = 1;
1|5
2|6
$> SELECT A, E FROM R, S WHERE R.A = S.D;
1|2
1|2

For this project, we will issue 5 queries to your program excluding CREATE TABLE queries. 2 of these queries will NOT be timed, and they will evaluated based on the correctness of the query results. Answering each query successfully will bring you 2 point each. An example file you will read the data from is given here. The remaining three queries will be timed, and they will run on files that are around 500 MB in total. You will performance points for each query for matching or beating the reference implementation timewise. Also keep in mind that for ALL queries, the grader will time out and exit after 5 minutes. There is also a memory limit that will not allow you to load full tables and cross product them for joins.

	Points for Correctness	Points for Performance	Table Size
Query 1	2	0	~500 MB
Query 2	2	0	~500 MB
Query 3	1	2	~500 MB
Query 4	2	2	~500 MB
Query 5	1	3	~500 MB