The ODIn Lab - CSE-4/562 Database Systems: Checkpoint 0

Checkpoint 2

In this project, you'll extend your SQL runtime to support larger queries. It is worth 12 points

Requirements

All .scala files in /src/main/scala and its subdirectories will be compiled and the main function of the object microbase.Microbase will be run.
The grader will wait until the code prints $>\n. If this takes more than 2 seconds, you will receive a 0.
The grader will write a series of CREATE TABLE and SELECT commands to your code's System.in, with one command per \n-delimited line. After processing each statement, your code must print $>\n on a new line to indicate that it is done. If your code exceeds a per-operation time-out, you will receive a 0 for that query and all subsequent parts of the assignment.
When your code is provided with a CREATE TABLE statement, this indicates that there is a file called data/[tableName].data, where [tableName] is the name of the table. This file contains UTF-8-encoded records, one per \n-delimited line, with fields in human-readable string representation (i.e., as in a CSV file) delimited by the pipe character (|). Note that there will not be any INSERT statements.
When your code is provided with a SELECT statement, it must evaluate the SELECT statement and print the results to System.out, one per \n-delimited line, with fields in human-readable string representations delimited by the pipe character (|). You will be expected to support the following features of SQL:
- Arbitrary expression targets
- Attribute and relation aliasing (i.e., SELECT bar.foo AS baz FROM table AS bar)
- Project, Filter, Table, and Equi-Joins
- Order-By and LIMIT queries
- FROM-Nested Subqueries
Your response to SELECT queries will be checked against Sqlite3.
Once again, your code must print $>\n on a new line after each CREATE TABLE or SELECT to indicate that it is done processing the statement.
You may use System.err to print debugging information to yourself. This output (or a subset of it) will be included in your debug log in autolab and will be ignored by the grading script.

Grading Rubric

All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below. Grading will be based on total runtime for each batch of queries.

30 randomly-generated Select-Project queries: Under 20 seconds total: 4 of 4 points; Under 10 seconds per query: 2 of 4 points
3 randomly-generated queries based on the TPC-H benchmark, with a scale factor of 0.1 (100MB) and templates listed below: Under 30 seconds total: 8 of 8 points + leaderboard ranking; Under 50 seconds total: 8 of 8 points; Under 120 seconds total: 4 of 8 points; Under 120 seconds per query: 2 of 8 points

Note in particular that these queries make extensive use of equi-joins, order-by, and limit clauses, which will now need to be supported.

Selection Pushdown and Join Conversion

Equi-joins, such as those in the TPC-H queries will be too expensive to compute as cartesian products. You will need to implement some form of equi-join algorithm: Hash, Sort-Merge, or similar. You will also need to implement some form of selection pushdown and join conversion. Both of these can be implemented almost exactly as described in class, via The easiest way to do this is to use pattern matching on the logical plan. For example you can find filters sitting over joins as so:

  plan match {
    case Filter(condition, Join(lhs, rhs, joinType, joinCondition, hint)) => 
      ???
  }

You may find the following methods helpful:

Expression.references returns an AttributeSet of attributes that appear in the expression.
LogicalPlan.outputSet returns an AttributeSet of attributes in the output of the plan node.
AttributeSet.subsetOf returns true if the caller is a subset of the parameter attribute set.

You may also find it helpful to decompose a boolean Expression into its component conjunctive atoms:

def splitAnds(expr: Expression): Seq[Expression] =
{
  expr match { 
    case And(a, b) => splitAnds(a) ++ splitAnds(b)
    case _ => Seq(expr)
  }
}

Example Queries

TPC-H is a standard database benchmark. The benchmark consists of a dataset generator and 22 standard query templates. This checkpoint uses three queries based on TPC-H Queries 3, 5, and 6. The dataset generator and template values can be found at the TPC-H website, and is run at scaling factor (SF) 0.1. Minor variations in the queries may be made.

SELECT
  LINEITEM.ORDERKEY,
  LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT) AS REVENUE,
  ORDERS.ORDERDATE,
  ORDERS.SHIPPRIORITY
FROM
  CUSTOMER,
  ORDERS,
  LINEITEM 
WHERE CUSTOMER.MKTSEGMENT = '[[SEGMENT]]' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY 
  AND ORDERS.ORDERDATE < DATE '[[DATE]]'
  AND LINEITEM.SHIPDATE > DATE '[[DATE]]'
ORDER BY REVENUE DESC, ORDERDATE
LIMIT 10

SELECT
  NATION.NAME,
  LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT) AS REVENUE
FROM
  REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
WHERE CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
  AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY 
  AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
  AND NATION.REGIONKEY = REGION.REGIONKEY
  AND REGION.NAME = '[[REGION]]'
  AND ORDERS.ORDERDATE >= DATE '[[DATE]]'
  AND ORDERS.ORDERDATE < DATE '[[DATE_PLUS_ONE]]'
ORDER BY REVENUE DESC
LIMIT 40

SELECT
  LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT AS REVENUE
FROM
  LINEITEM
WHERE LINEITEM.SHIPDATE >= DATE '[[DATE]]'
  AND LINEITEM.SHIPDATE < DATE '[[DATE_PLUS_ONE]]'
  AND LINEITEM.DISCOUNT > CAST([[DISCOUNT_LOW]] as float) AND LINEITEM.DISCOUNT < CAST([[DISCOUNT_HIGH]] as float)
  AND LINEITEM.QUANTITY < CAST([[QUANTITY]] as float)
ORDER BY REVENUE DESC
LIMIT 10