August 27
Capen 212
Always add [CSE662] to the title of emails
Each group will have a separate project. I don't expect cheating to be an issue, but to be clear...
Databases | Programming Languages |
---|---|
Indexes | Data Structures |
Transactions/Logging | Software Transactional Memory |
Incremental Views | Self-Adapting Computation |
Query Rewriting / Performance Models | Compiler Optimization / Program Analysis |
Probabilistic DBs | Probabilistic Programming |
Tuesday | Thursday |
---|---|
Classical Lecture (Paper of the Week) |
Group Presentations and Meetings |
I use a 3 point system:
You get 2 excused absences (guarantees I won't call on you) for the term. You must let me know beforehand
Lazy evaluation of transactions in database systems
Jose M. Faleiro, Alexander Thomson, Daniel J. Abadi
Be ready to intelligently discuss the paper's contents Tuesday Sept. 3
"The Case for Learned Index Structures"
by Kraska, Beutel, Chi, Dean, Polyzotis
$f(key) \mapsto position$
(not exactly true, but close enough for today)
Simplified Use Case: Static data with "infinite" prep time.
We have infinite prep time, so fit a (tiny) neural network to the CDF.
if
statements are really expensive on modern processors.Try to repeat the author's success!
Try to find overlooked problems with the author's success!
When disks are involved, sorted data layouts are fantastic for reads! $O(1)$ IOs to access any data.
... but horrible for writes. $O(N)$ IOs for writes. ($O(N^2)$ total cost)
Idea 1: Add a buffer. Now each buffer merge is $O(N)$ IOs. ($O(\frac{N^2}{B}$ total cost)
Idea 2: Multiple layers of buffers. Merge buffers with other siilarly sized buffers. ($O(N\log(N))$ total cost)
Everything reduces to a small set of primitive operations with well-defined equivalence rules and simplifications.
One approach Scalable Linear Algebra on a Relational Database System
Leverage an existing relational database optimizer to schedule execution for evaluating a linear algebra expression
Another approach Spark MLLib Linear Methods
Unoptimized implementation directly over Spark RDDs (Top Hit: "Spark for Linear Algebra: don't - ChapterZero")
Apples-to-apples comparison
Ideas: Combine the Two: Linear algebra over Spark DataFrames/Catalyst, and/or build your own optimizer
How do you benchmark database systems?
Find a dataset
What (representative) queries do you ask?
Find a query log
Data is often full of PIID and harder to access
Easy to get data OR queries, but hard to get both.
Idea 3: Get one from the other.
You may also have affected-row counts
Given a log of SQL queries + DDL expressions, generate a (representative) dataset that the query log will run over.
Example: Phone Lab Trace
Example: Sloan Digital Skyserver
UPDATE foo SET A = 1 WHERE B > 5;
Replace DDL/DML operations with equivalent "reenactment queries"
CREATE VIEW foo_v2
SELECT CASE WHEN B > 5 THEN 1 ELSE A END AS A
B, C, D, ...
FROM foo_v1;
But you wind up with gnarly queries...
Idea: Automate table extraction
Directory → Table[s]
(Project co-advised by William Spoth)