Checkpoint 1: Basic Queries
- Overview: Submit a simple SPJUA query evaluator.
- Deadline: Feb 25
- Grade: 10% of Project Component
- 5% Correctness
- 5% Efficiency
In this project, you will implement a simple SQL query evaluator with support for Select, Project, Join, and Bag Union operations. You will receive a set of data files, schema information, and be expected to evaluate multiple SELECT queries over those data files.
Your code is expected to evaluate the SELECT statements on provided data, and produce output in a standardized form. Your code will be evaluated for both correctness and performance (in comparison to a naive evaluator based on iterators and nested-loop joins).
Parsing SQL
A parser converts a human-readable string into a structured representation of the program (or query) that the string describes. A fork of the
JSQLParser open-source SQL parser (JSQLParser) will be provided for your use. The JAR may be downloaded from
http://maven.mimirdb.info/info/mimirdb/jsqlparser/1.0.0/jsqlparser-1.0.0.jar
And documentation for the fork is available at
http://doc.odin.cse.buffalo.edu/jsqlparser/
You are not required to use this parser (i.e., you may write your own if you like). However, we will be testing your code on SQL that is guaranteed to parse with JSqlParser.
Basic use of the parser requires a
java.io.Reader or
java.io.InputStream from which the file data to be parsed (For example, a
java.io.FileReader). Let's assume you've created one already (of either type) and called it
inputFile.
CCJSqlParser parser = new CCJSqlParser(inputFile);
System.out.println("$> "); // print a prompt
Statement statement;
while((statement = parser.Statement()) != null){
// `statement` now has one of the several
// implementations of the Statement interface
System.out.println("$> "); // print a prompt after executing each command
}
// End-of-file. Exit!
At this point, you'll need to figure out what kind of statement you're dealing with. For this project, we'll be working with
Select and
CreateTable. There are two ways to do this. JSqlParser defines a Visitor style interface that you can use if you're familiar with the pattern. However, my preference is for the simpler and lighter-weight
instanceof relation:
if(statement instanceof Select) {
Select selectStatement = (Select)statement;
// handle the select
} else if(statement instanceof CreateTable) {
// and so forth
}
Expressions
JSQLParser includes an object called Expression that represents a primitive-valued expression parse tree. In addition to the parser, we are providing a collection of classes for manipulating and evaluating Expressions. The JAR may be downloaded from
http://maven.mimirdb.info/info/mimirdb/evallib/1.0/evallib-1.0.jar
Documentation for the library is available at
https://github.com/UBOdin/evallib/blob/master/README.md
To use the Eval class, you will need to define a method for dereferencing Column objects. For example, if I have a Map called tupleSchema that contains my tuple schema, and an ArrayList called tuple that contains the tuple I am currently evaluating, I might write:
public void PrimitiveValue eval(Column x){
int colID = tupleSchema.get(x.getName());
return tuple.get(colID);
}
After doing this, you can use Eval.eval() to evaluate any expression in the context of tuple.
Source Data
Because you are implementing a query evaluator and not a full database engine, there will not be any tables -- at least not in the traditional sense of persistent objects that can be updated and modified. Instead, you will be given a
Table Schema and a
CSV File with the instance in it. To keep things simple, we will use the
CREATE TABLE statement to define a relation's schema. You do not need to allocate any resources for the table in reaction to a
CREATE TABLE statement -- Simply save the schema that you are given for later use. Sql types (and their corresponding java types) that will be used in this project are as follows:
SQL Type |
Equivalent PrimitiveValue |
string |
StringValue |
varchar |
StringValue |
char |
StringValue |
int |
LongValue |
decimal |
DoubleValue |
date |
DateValue |
In addition to the schema, you will be given a data directory containing multiple data files who's names correspond to the table names given in the
CREATE TABLE statements. For example, let's say that you see the following statement in your query file:
CREATE TABLE R(A int, B int, C int);
That means that the data directory contains a data file called 'R.csv' that might look like this:
1|1|5
1|2|6
2|3|7
Each line of text (see
java.io.BufferedReader.readLine()) corresponds to one row of data. Each record is delimited by a vertical pipe '|' character. Integers and floats are stored in a form recognized by Java’s Long.parseLong() and Double.parseDouble() methods. Dates are stored in YYYY-MM-DD form, where YYYY is the 4-digit year, MM is the 2-digit month number, and DD is the 2-digit date. Strings are stored unescaped and unquoted and are guaranteed to contain no vertical pipe characters.
Queries
Your code is expected to support non-aggregate queries with the following features. Keep in mind that this is only a minimum requirement; you're welcome to support more.
- SelectItems may include:
- SelectExpressionItem: Any expression that ExpressionLib can evaluate. Note that Column expressions may or may not include an appropriate source. Where relevant, column aliases will be given, unless the SelectExpressionItem's expression is a Column (in which case the Column's name attribute should be used as an alias)
- AllTableColumns: For any aliased term in the from clause
- AllColumns: If present, this will be the only SelectItem in a given PlainSelect.
- From/Joins may include:
- Join: All joins will be simple joins
- Table: Tables may or may not be aliased. Non-Aliased tables should be treated as being aliased to the table's name.
- SubSelect: SubSelects may be aggregate or non-aggregate queries, as here.
- The Where/Having clauses may include:
- Any expression that ExpressionLib will evaluate to an instance of BooleanValue
- Allowable Select Options include
- UNION ALL (but not UNION)
Output
Your code is expected output query results in the same format as the input data:
- One output row per ('\n'-delimited) line. If there is no ORDER BY clause, you may emit the rows in any order.
- One output value per ('|'-delimited) field. Columns should appear in the same order that they appear in the query. Table Wildcards should be resolved in the same order that the columns appear in the CREATE TABLE statement. Global Wildcards should be resolved as Table Wildcards with the tables in the same order that they appear in the FROM clause.
- A trailing newline as the last character of the file.
- You should not output any header information or other formatting.
Example Queries and Data
These are only examples. Your code will be expected to handle these queries, as well as others.
- Sanity Check Examples
- A thorough suite of test cases covering most simple query features. For this checkpoint, your code should support all of the TABLEXX and UNIONXX queries.
- Example NBA Benchmark Queries
- Some very simple queries to get you started. For this checkpoint, your code should support
- nba01.sql
- nba03.sql
- nba05.sql (but not nba05-short.sql)
- nba06.sql (but not nba06-short.sql)
Code Submission
As before, all .java and .scala files your GIT repository will be compiled. Also as before, the class dubstep.Main will be called and you will be expected to read SQL from System.in. Data will live in the data directory; Each CREATE TABLE indicates that there is a corresponding file data/[tablename].csv
Once your code is ready, it should print out a prompt:
$>
It should also print out this prompt after completing every SQL operation (After parsing each CREATE TABLE and after producing results for each SELECT). If you do not print out the prompt, the grader will assume your code is frozen, your trial will time out, and you will receive a 0 for the test.
As before, data will live in the data directory and
For example, assuming the file data/R.csv contains
1|1|5
1|2|6
2|3|7
The resulting interaction should look like:
bash> java -cp your_code.jar:..other jars.. dubstep.Main
$> CREATE TABLE R(A int, B int, C int);
$> SELECT B, C FROM R WHERE A = 1;
1|5
2|6
$>
The testing environment is configured with the Sun JDK version 1.8. and the Scala 11.8 library.
Grading
Your code will be subjected to a sequence of test cases and evaluated for both correctness and performance. Incorrect results receive a grade of 0.
Correct results receive a grade based on performance. The following thresholds are approximate, but as a rough guideline:
- 5/10 (C): Roughly within two orders of magnitude of the reference implementation.
- 7.5/10 (B): Roughly within one order of magnitude of the reference implementation.
- 10/10 (A): Comparable to the reference implementation.
Specific time thresholds and per-query scores will be reported as part of the execution log. Your overall project grade will be a weighted average of the individual components.
Additionally, there will be a per-query leader-board.
To appear on the leader board, your group needs to submit an implementation that is approximately 50% faster than the reference implementation on that query.