CSE 562

March 16 - Checkpoint 2

Checkpoint 2

sif$ javac -cp build:*.jar -D build {all .java files} sif$ java -cp build:jsqlparser.jar:evallib.jar dubstep.Main
 - --preload --in-mem $> CREATE TABLE R( A int, B int, C int, PRIMARY KEY (A), INDEX B_INDEX (B) ); $> SELECT A, SUM(C) FROM R WHERE B < 5 GROUP BY A; 1|5 ... $>

Checkpoint 2

  • Two parts, one in-memory, one with more data than memory.
  • Preprocessing Step (You get 2 or 5 minutes for each CREATE TABLE)
  • New SQL features (GROUP-BY, LIMIT, SORT, From-Nesting)
  • Create table now includes "hints" (PRIMARY KEY, INDEX)
  • Tighter Constraints (Bigger data and less time/query)
--in-mem
Java will be allowed to use 1GB of heap space (-Xmx1g), and the CSV representation of all input data data will be under 50MB (similar size to checkpoint 1's 'Big Data').
--on-disk
Java will be allowed to use 150MB of heap space (-Xmx150m), and the textual representation of your data will be over 100MB.

As a general guideline, Java becomes incredibly slow once you hit 50% memory usage as you spend more time in the garbage collector than in your code.

--preload
Signals that CREATE TABLE statements will be allowed to take 2 minutes in in-memory mode, or 5 minutes in on-disk mode. This is a good opportunity to sort your data, and/or create the index structures suggested in the statement body.

Professor's note: For an illustration of the potential benefits of a longer CREATE TABLE step, have a look at the leaderboards for Checkpoint 1.

New SQL Features: Sort

ORDER BY col1 asc/desc, col2 asc/desc, ...
Sort the data on col1 (using col2, col3, ... as tiebreakers) in ascending or descending order.
  • You will likely need 2 implementations:
    • In-Memory (Reference Impl Uses Java's Collections.sort)
    • On-Disk (Reference Impl Uses 2-Pass Sort)

New SQL Features: Limit

LIMIT N
Return only the first N rows. If the data is sorted, return the first rows according to the sort order. If not, return an arbitrary N rows.

JSQLParser also supports more expressive limit clauses (e.g., including offsets). You will not be required to support anything more complex than LIMIT N.

New SQL Features: Group-By

SELECT A, B, SUM(C), ... FROM ... GROUP BY A, B
Group the data by the A and B columns and apply the aggregate functions to each group.

New SQL Features: From Nesting

SELECT * FROM (SELECT A, B FROM R) q
You're still not going to get more than one FromItem, but now it can be a SELECT query instead of a table.

If you used iterators for checkpoint 1, you should just be able to replace a Table Scan iterator with the iterator you construct for the nested query

Handling Tight Constraints

Index Scans
You'll need to detect when opportunities to use them arise. Remember that your WHERE clause may be an AND of different options, and you may have multiple indices that you can use.
Selection Pushdown
Selection Pushdown is ALWAYS good. This is most likely to arise in the context of nested subqueries, but if you implement it in a general way you'll benefit in Checkpoint 3 as well.

Pattern Matching

Remember pattern matching on Statement and Expression objects? Try it on Iterators!

Iterator optimize(Iterator query) {
  if(query instanceof Filter){
    Filter f = (Filter)query;
    if(f.input instanceof Project){
      Project p (Project)f.input;
      // replace query with a new pair of iterators where 
      // the Project uses the Filter as an input. 
    }
  }
  // Recur
  query.input = optimize(query.input);
  return query;
}
		

Questions?