Checkpoint 2
In this project, you'll extend your SQL runtime to support larger queries. It is worth 12 points
Requirements
- All .scala files in /src/main/scala and its subdirectories will be compiled and the main function of the object microbase.Microbase will be run.
- The grader will wait until the code prints $>\n. If this takes more than 2 seconds, you will receive a 0.
- The grader will write a series of CREATE TABLE and SELECT commands to your code's System.in, with one command per \n-delimited line. After processing each statement, your code must print $>\n on a new line to indicate that it is done. If your code exceeds a per-operation time-out, you will receive a 0 for that query and all subsequent parts of the assignment.
- When your code is provided with a CREATE TABLE statement, this indicates that there is a file called data/[tableName].data, where [tableName] is the name of the table. This file contains UTF-8-encoded records, one per \n-delimited line, with fields in human-readable string representation (i.e., as in a CSV file) delimited by the pipe character (|). Note that there will not be any INSERT statements.
- When your code is provided with a SELECT statement, it must evaluate the SELECT statement and print the results to System.out, one per \n-delimited line, with fields in human-readable string representations delimited by the pipe character (|). You will be expected to support the following features of SQL:
- Arbitrary expression targets
- Attribute and relation aliasing (i.e., SELECT bar.foo AS baz FROM table AS bar)
- Project, Filter, Table, and Equi-Joins
- Order-By and LIMIT queries
- FROM-Nested Subqueries
- Your response to SELECT queries will be checked against Sqlite3.
- Once again, your code must print $>\n on a new line after each CREATE TABLE or SELECT to indicate that it is done processing the statement.
- You may use System.err to print debugging information to yourself. This output (or a subset of it) will be included in your debug log in autolab and will be ignored by the grading script.
Grading Rubric
All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below. Grading will be based on total runtime for each batch of queries.
- 30 randomly-generated Select-Project queries
- Under 20 seconds total: 4 of 4 points
- Under 10 seconds per query: 2 of 4 points
- 3 randomly-generated queries based on the TPC-H benchmark, with a scale factor of 0.1 (100MB) and templates listed below
- Under 30 seconds total: 8 of 8 points + leaderboard ranking
- Under 50 seconds total: 8 of 8 points
- Under 120 seconds total: 4 of 8 points
- Under 120 seconds per query: 2 of 8 points
Note in particular that these queries make extensive use of equi-joins, order-by, and limit clauses, which will now need to be supported.
Selection Pushdown and Join Conversion
Equi-joins, such as those in the TPC-H queries will be too expensive to compute as cartesian products. You will need to implement some form of equi-join algorithm: Hash, Sort-Merge, or similar. You will also need to implement some form of selection pushdown and join conversion. Both of these can be implemented almost exactly as described in class, via
The easiest way to do this is to use pattern matching on the logical plan. For example you can find filters sitting over joins as so:
plan match {
case Filter(condition, Join(lhs, rhs, joinType, joinCondition, hint)) =>
???
}
You may find the following methods helpful:
You may also find it helpful to decompose a boolean Expression into its component conjunctive atoms:
def splitAnds(expr: Expression): Seq[Expression] =
{
expr match {
case And(a, b) => splitAnds(a) ++ splitAnds(b)
case _ => Seq(expr)
}
}
Example Queries
TPC-H is a standard database benchmark. The benchmark consists of a dataset generator and 22 standard query templates. This checkpoint uses three queries based on TPC-H Queries 3, 5, and 6. The dataset generator and template values can be found at the TPC-H website, and is run at scaling factor (SF) 0.1. Minor variations in the queries may be made.
SELECT
LINEITEM.ORDERKEY,
LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT) AS REVENUE,
ORDERS.ORDERDATE,
ORDERS.SHIPPRIORITY
FROM
CUSTOMER,
ORDERS,
LINEITEM
WHERE CUSTOMER.MKTSEGMENT = '[[SEGMENT]]' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
AND ORDERS.ORDERDATE < DATE '[[DATE]]'
AND LINEITEM.SHIPDATE > DATE '[[DATE]]'
ORDER BY REVENUE DESC, ORDERDATE
LIMIT 10
SELECT
NATION.NAME,
LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT) AS REVENUE
FROM
REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
WHERE CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
AND CUSTOMER.NATIONKEY = NATION.NATIONKEY
AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
AND NATION.REGIONKEY = REGION.REGIONKEY
AND REGION.NAME = '[[REGION]]'
AND ORDERS.ORDERDATE >= DATE '[[DATE]]'
AND ORDERS.ORDERDATE < DATE '[[DATE_PLUS_ONE]]'
ORDER BY REVENUE DESC
LIMIT 40
SELECT
LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT AS REVENUE
FROM
LINEITEM
WHERE LINEITEM.SHIPDATE >= DATE '[[DATE]]'
AND LINEITEM.SHIPDATE < DATE '[[DATE_PLUS_ONE]]'
AND LINEITEM.DISCOUNT > CAST([[DISCOUNT_LOW]] as float) AND LINEITEM.DISCOUNT < CAST([[DISCOUNT_HIGH]] as float)
AND LINEITEM.QUANTITY < CAST([[QUANTITY]] as float)
ORDER BY REVENUE DESC
LIMIT 10