Checkpoint 2

Overview: New SQL features, Limited Memory, Faster Performance
Deadline: March 16
Grade: 10% of Project Component
- 6% Correctness
- 4% Efficiency

This project follows the same outline as Checkpoint 1. Your code gets SQL queries and is expected to answer them. There are a few key differences:

Queries may now include CROSS PRODUCTs and JOINs.
Queries may now include a ORDER BY clause. Sort is a blocking, or 2-pass operator. Ideally you need to handle both the case where you can fit everything into memory and the case where you can not. However, this time, you are only required to handle the case where you can fit everything into memory. However, everything does not cover SQL to RA translation mistakes, so be careful where you place your PROJECTION.
Queries may now include a LIMIT clause, and/or a nested subquery. The subquery can be either in the FROM clause or in the WHERE clause.

Sorting and Grouping Data

Sort is a blocking operator. Before it emits even one row, it needs to see the entire dataset. If you have enough memory to hold the entire input to be sorted, then you can just use Java's built-in Collections.sort method. However, for at least a few queries you will likely not have enough memory to keep everything available. In that case, a good option is to use the 2-pass sort algorithm that we discussed in class.

Grading Workflow

All .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). A main file that you can take as an example is given here. As before, the class edu.buffalo.www.cse4562.Main will be invoked with no arguments, and a stream of semicolon-delimited queries will be printed to System.in (after you print out a prompt)

For example (red text is entered by the user/grader):

bash> ls data
R.dat
S.dat
T.dat
bash> cat data/R.dat
1|1|5
1|2|6
2|3|7
bash> cat data/S.dat
1|2|6
3|3|2
3|5|2
bash> find {code root directory} -name \*.java -print > compile.list
bash> javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list
bash> java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main --data data/
$> CREATE TABLE R(A int, B int, C int);
$> CREATE TABLE S(D int, E int, F int);
$> SELECT B, C FROM R WHERE A = 1;
1|5
2|6
$> SELECT A, E FROM R, S WHERE R.A = S.D;
1|2
1|2

For this project, we will issue 5 queries to your program excluding CREATE TABLE queries. 3 of these queries will NOT be timed, and they will evaluated based on the correctness of the query results. Answering each query successfully will bring you 1 point each. An example file you will read the data from is given here. The remaining two queries will be timed, and they will run on files that have around 1000 tuples (~100 KB). You will receive 1.5 points for each query if you can return correct results. You will receive additional 2 points for each query for matching or beating the reference implementation timewise. Also keep in mind that for ALL queries, the grader will time out and exit after 5 minutes. There is also a memory limit that will not allow you to load full tables and cross product them for joins.

	Points for Correctness	Points for Performance	Table Size
Query 1	1	0	~500 tuples each table
Query 2	1	0	~500 tuples each table
Query 3	1	0	~500 tuples each table
Query 4	1.5	2	~1000 tuples each table
Query 5	1.5	2	~1000 tuple each table