In this project, you will implement a simple SQL query evaluator with support for Select, Project, Join, Bag Union, and Aggregate operations.  You will receive a set of data files, schema information, and be expected to evaluate multiple SELECT queries over those data files. Your code is expected to evaluate the SELECT statements on provided data, and produce output in a standardized form. Your code will be evaluated for both correctness and performance (in comparison to a naive evaluator based on iterators and nested-loop joins).

Parsing SQL

A parser converts a human-readable string into a structured representation of the program (or query) that the string describes. A fork of the JSQLParser open-source SQL parser (JSQLParser) will be provided for your use.  The JAR may be downloaded from

And documentation for the fork is available at

You are not required to use this parser (i.e., you may write your own if you like). However, we will be testing your code on SQL that is guaranteed to parse with JSqlParser. Basic use of the parser requires a or from which the file data to be parsed (For example, a Let's assume you've created one already (of either type) and called it inputFile.
CCJSqlParser parser = new CCJSqlParser(inputFile);
Statement statement;
while((statement = parser.Statement()) != null){
  // `statement` now has one of the several 
  // implementations of the Statement interface
// End-of-file.  Exit!
At this point, you'll need to figure out what kind of statement you're dealing with. For this project, we'll be working with Select and CreateTable. There are two ways to do this. JSqlParser defines a Visitor style interface that you can use if you're familiar with the pattern. However, my preference is for the simpler and lighter-weight instanceof relation:
if(statement instanceof Select) {
  Select selectStatement = (Select)statement;
  // handle the select
} else if(statement instanceof CreateTable) {
  // and so forth



JSQLParser includes an object called Expression that represents a primitive-valued expression parse tree.  In addition to the parser, we are providing a collection of classes for manipulating and evaluating Expressions.  The JAR may be downloaded from

 Documentation for the library is available at

To use the Eval class, you will need to define a method for dereferencing Column objects.  For example, if I have a Map called tupleSchema that contains my tuple schema, and an ArrayList called tuple that contains the tuple I am currently evaluating, I might write:

public void LeafValue eval(Column x){
  int colID = tupleSchema.get(x.getName());
  return tuple.get(colID);

After doing this, you can use Eval.eval() to evaluate any expression in the context of tuple.

Source Data

Because you are implementing a query evaluator and not a full database engine, there will not be any tables -- at least not in the traditional sense of persistent objects that can be updated and modified. Instead, you will be given a Table Schema and a CSV File with the instance in it. To keep things simple, we will use the CREATE TABLE statement to define a relation's schema. You do not need to allocate any resources for the table in reaction to a CREATE TABLE statement -- Simply save the schema that you are given for later use. Sql types (and their corresponding java types) that will be used in this project are as follows:
SQL Type Java Equivalent
string StringValue
varchar StringValue
char StringValue
int LongValue
decimal DoubleValue
date DateValue
In addition to the schema, you will be given a data directory containing multiple data files who's names correspond to the table names given in the CREATE TABLE statements. For example, let's say that you see the following statement in your query file:
CREATE TABLE R(A int, B int, C int);
That means that the data directory contains a data file called 'R.dat' that might look like this:
Each line of text (see corresponds to one row of data. Each record is delimited by a vertical pipe '|' character.  Integers and floats are stored in a form recognized by Java’s Long.parseLong() and Double.parseDouble() methods. Dates are stored in YYYY-MM-DD form, where YYYY is the 4-digit year, MM is the 2-digit month number, and DD is the 2-digit date. Strings are stored unescaped and unquoted and are guaranteed to contain no vertical pipe characters.


Your code is expected to support both aggregate and non-aggregate queries with the following features.  Keep in mind that this is only a minimum requirement.


Your code is expected output query results in the same format as the input data:

Example Queries and Data

These are only examples.  Your code will be expected to handle these queries, as well as others. Sanity Check Examples: A thorough suite of test cases covering most simple query features. Example NBA Benchmark Queries: Some very simple queries to get you started. The TPC-H Benchmark: This benchmark consists of two parts: DBGen (generates the data) and a specification document (defines the queries).  A nice summary of the TPC-H queries can be found here. The SQL implementation used by TPC-H differs in a few subtle ways from the implementation used by JSqlParser.  Minor structural rewrites to the queries in the specification document will be required: Queries that conform to the specifications for this project include: Q1, Q3, Q5, Q6, Q8*, Q9, Q10, Q12*, Q14*, Q15*, Q19* (Asterisks mean that the query doesn't meet the spec as written, but can easily be rewritten into one that does)

Code Submission

As before, all .java files in the src directory at the root of your repository will be compiled (and linked against JSQLParser). Also as before, the class
will be invoked with the following arguments: For example:
$> ls data
$> cat R.dat
$> cat query.sql
CREATE TABLE R(A int, B int, C int)
$> java -cp build:jsqlparser.jar edu.buffalo.cse562.Main --data data query.sql
Once again, the data directory contains files named table name.dat where table name is the name used in a CREATE TABLE statement. Notice the effect of CREATE TABLE statements is not to create a new file, but simply to link the given schema to an existing .dat file. These files use vertical-pipe (’|’) as a field delimiter, and newlines (’\n’) as record delimiters. The testing environment is configured with the Sun JDK version 1.8.


Your code will be subjected to a sequence of test cases, most of which are provided in the project code (though different data will be used). Two evaluation phases will be performed. Phase 1 will be performed on small datasets (< 100 rows per input table) and each run will be graded on a per-test-case basis as follows: Phase 2 will evaluate your code on more complex queries that create large intermediate states (100+ MB). Queries for which your submission does not produce correct output, or which your submission takes over 1 minute to process will receive an F. Otherwise, your submission will be graded on the runtime of each test as follows Your overall project grade will be a weighted average of the individual components.  It will be possible to earn extra credit by beating the reference implementation. Additionally, there will be a per-query leader-board for all groups who manage to beat the reference implementation.

This page last updated 2024-05-06 11:22:17 -0400