Checkpoint 2

This project follows the same outline as Checkpoint 1. Your code gets SQL queries and is expected to answer them. There are a few key differences:

Sorting and Grouping Data

Sort is a blocking operator. Before it emits even one row, it needs to see the entire dataset. If you have enough memory to hold the entire input to be sorted, then you can just use Java's built-in Collections.sort method. However, for the memory-restricted part of the workflow, you will likely not have enough memory to keep everything available. In that case, a good option is to use the 2-pass sort algorithm that we discussed in class.

Join Ordering

The order in which you join tables together is incredibly important, and can change the runtime of your query by multiple orders of magnitude.  Picking between different join orderings is incredibly important!  However, to do so, you will need statistics about the data, something that won't really be feasible until the next project.  Instead, here's a present for those of you paying attention.  The tables in each FROM clause are ordered so that you will get our recommended join order by building a left-deep plan going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.

Query Rewriting

In Project 1, you were encouraged to parse SQL into a relational algebra tree.  Project 2 is where that design choice begins to pay off.  We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them.  For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.

if(o instanceof Selection){
  Selection s = (Selection)o;
  if(s.getChild() instanceof CrossProduct){
    CrossProduct prod = 
       (CrossProduct)s.getChild();
    Expression join_cond = 
       // find a good join condition in 
       // the predicate of s.
    Expression rest =      
       // the remaining conditions
    return new Selection(
      rest, 
      new HashJoin(
        join_cond, 
        prod.getLHS(), 
        prod.getRHS()
      )
    );
  }
}
return o;

The reference implementation has a function similar to this snippet of code, and applies the function to every node in the relational algebra tree.

Because selection can be decomposed, you may find it useful to have a piece of code that can split AndExpressions into a list of conjunctive terms:

List<Expression> splitAndClauses(Expression e) 
{
  List<Expression> ret = 
     new ArrayList<Expression();
  if(e instanceof AndExpression){
    AndExpression a = (AndExpression)e;
    ret.addAll(
      splitAndClauses(a.getLeftExpression())
    );
    ret.addAll(
      splitAndClauses(a.getRightExpression())
    );
  } else {
    ret.add(e);
  }
}

Grading Workflow

As before, the class dubstep.Main will be invoked and a stream of semicolon-delimited queries will be printed to System.in (one after after each time you print out a prompt)

All .java / .scala files in your repository will be compiled (and linked against JSQLParser). Your code will be subjected to a sequence of test cases and evaluated on speed and correctness.  Note that unlike Project 1, you will neither receive a warning about, nor partial credit for out-of-order query results if the outermost query includes an ORDER BY clause. For this checkpoint, we will use predominantly queries chosen from the TPC-H benchmark workload.

Phase 1 (big queries) will be graded on a TPC-H SF 1 dataset (1 GB of raw text data).  Phase 2 (limited memory) will be graded on either a TPC-H SF 1 or SF 0.2 (200 MB of raw text data).  Grades are assigned based on per-query thresholds:

Unlike before, your code will be given arguments. During the initial phase of the workload, your code will be launched with --in-mem as one of its arguments. During the memory-restricted phase of the workload, your code will be launched with --on-disk as one of its arguments. You may use the data/ directory to store temporary files.

For example (red text is entered by the user/grader):

bash> ls data
R.dat
S.dat
T.dat
bash> cat data/R.dat
1|1|5
1|2|6
2|3|7
bash> cat data/S.dat
1|2|6
3|3|2
3|5|2
bash> find {code root directory} -name \*.java -print > compile.list
bash> javac -cp {libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar -d {compiled directory name} @compile.list
bash> java -cp {compiled directory name}/src/:{libs location}/commons-csv-1.5.jar:{libs location}/evallib-1.0.jar:{libs location}/jsqlparser-1.0.0.jar edu.buffalo.www.cse4562.Main - --in-mem
$> CREATE TABLE R(A int, B int, C int);
$> CREATE TABLE S(D int, E int, F int);
$> SELECT B, C FROM R WHERE A = 1;
1|5
2|6
$> SELECT A, E FROM R, S WHERE R.A = S.D;
1|2
1|2

For this project, we will issue a sequence of queries to your program and time your performance. A randomly chosen subset of these queries will be checked for correctness. Producing an incorrect answer on any query will result in a 0.