This checkpoint is, in effect, a more rigorous form of Checkpoints 1 and 2. The requirements are identical: We give you a query and some data, you evaluate the query on the data and give us a response as quickly as possible.
We'll be expecting a more feature-complete submission. Your code will be evaluated on more queries from TPC-H benchmark, which exercises a broader range of SQL features than the previous checkpoints did.
The order in which you join tables together is incredibly important, and can change the runtime of your query by multiple orders of magnitude. Picking between different join orderings is incredibly important! However, to do so, you will need statistics about the data, something that won't really be feasible until later. Instead, here's a present for those of you paying attention. The tables in each FROM clause are ordered so that you will get our recommended join order by building a left-deep plan going in-order of the relation list (something that many of you are doing already), and (for hybrid hash joins) using the left-hand-side relation to build your hash table.
In the prior checkpoints, you were encouraged to parse SQL into a relational algebra tree. This checkpoint is where that design choice will begins to pay off. We've discussed expression equivalences in relational algebra, and identified several that are always good (e.g., pushing down selection operators). The reference implementation uses some simple recursion to identify patterns of expressions that can be optimized and rewrite them. For example, if I wanted to define a new HashJoin operator, I might go through and replace every qualifying Selection operator sitting on top of a CrossProduct operator with a HashJoin.
The checkpoint 3 reference implementation optimizer implements three improvements:
Your code will be evaluated in exactly the same way as Project 1. Data will drawn from a 1GB (SF 1) TPC-H dataset.
java -cp build:jsqlparser.jar dubstep.Main --data [data] [sqlfile1] [sqlfile2] ...This example uses the following directories and files:
Your code will be subjected to a sequence of test cases and evaluated on speed and correctness. Note that unlike Project 1, you will neither receive a warning about, nor partial credit for out-of-order query results if the outermost query includes an ORDER BY clause.
Phase 1 (big queries) will be graded on a TPC-H SF 0.1 dataset (100 MB of raw text data). Phase 2 (limited memory) will be graded on datasets of varying sizes. Grades are assigned based on per-query thresholds:
TPC-H Query | Phase 1 Runtimes | Phase 2 Runtimes | Phase 2 Scaling Factor | |
---|---|---|---|---|
1 | TBD | A | TBD | SF = 0.1 |
TBD | B | TBD | ||
TBD | C | TBD | ||
3 | TBD | A | TBD | SF = 0.1 |
TBD | B | TBD | ||
TBD | C | TBD | ||
5 | TBD | A | TBD | SF = 0.1 |
TBD | B | TBD | ||
TBD | C | TBD | ||
10 | TBD | A | TBD | SF = 0.5 |
TBD | B | TBD | ||
TBD | C | TBD | ||
12 | TBD | A | TBD | SF = 0.5 |
TBD | B | TBD | ||
TBD | C | TBD |
This page last updated 2024-12-09 13:28:25 -0500