# CSE 662 - Fall 2019

Addressing the challenges of big data requires a combination of human intuition and automation. Rather than tackling these challenges head-on with build-from-scratch solutions, or through general-purpose database systems, developer and analyst communities are turning to building blocks: Specialized languages, runtimes, data-structures, services, compilers, and frameworks that simplify the task of creating a systems that are powerful enough to handle terabytes of data or more or efficient enough to run on your smartphone. In this class, we will explore these fundamental building blocks and how they relate to the constraints imposed by workloads and the platforms they run on.

Coursework consists of lectures and a multi-stage final project. Students are expected to attend all lectures and present at least one paper related to their project. Projects may be performed individually or in groups. Projects will be evaluated in three stages through code deliverables, reports, and group meetings with the instructor (or in rare cases a designated project supervisor). During these meetings, instructors will question the entire group extensively about the group's report, deliverables, and any related tools and technology.

1. At initial stage, students are expected to demonstrate a high level of proficiency with the tools, techniques, data structures, algorithms and source code that will form the basis of their project. The group is expected to submit and defend a roughly 5-page report surveying the space in which their project will be performed. This report and presentation constitute 15% of the final grade.
2. At the second stage, students are expected to provide a detailed design for their final project. A roughly 5-page report should outline the group’s proposed design, any algorithms or data structures being introduced, as well as a strategy for evaluating the resulting project against the current state of the art. This report and presentation constitute 15% of the final grade.
3. At the final stage, students are expected to provide a roughly 5-page report detailing their project, any algorithms or data structures developed, and evaluating their project against any comparable state of the art systems and techniques. Groups will also be expected to demonstrate their project and present their findings in-class, or in a meeting with both instructors if necessitated by time constraints. This report and presentation constitute 50% of the final grade.

## Information

• Instructors
• Oliver Kennedy (Capen 212; Office Hours TBD)
• Time: TR 5:00-6:20 PM
• Location: NSC 201
• Checkpoint Submissions
• Checkpoint 1: Sept. 25
• Checkpoint 2: Oct. 23
• Checkpoint 3: Dec. 6

## Course Objectives

After the taking the course, students should be able to:

• Design domain specific query languages, by first developing an understanding the common tropes of a target domain, exploring ways of allowing users to efficiently express those tropes, and developing ways of mapping the resulting programs to an efficient evaluation strategy.

• Identify concurrency challenges in data-intensive computing tasks, and address them through locking, code associativity, and correctness analysis.

• Understand a variety of index data structures, as well as their application and use in data management systems for high velocity, volume, veracity, and/or variety data.

• Understand query and program compilation techniques, including the design of intermediate representations, subexpression equivalence, cost estimation, and the construction of target-representation code.

## Lecture Notes

• Aug 27 - Intro and Seeds (slides)
• Sep 3 - Functional Data Structures (slides)
• Sep 5 - Lazy Transactions (reading)
• Sep 10 - Consistency through Monotonicity (reading | slides)
• Sep 12 - Recursive View Maintenance (reading)
• Sep 19 - Learned Index Structures presented by Alpha Nebula (slides)
• Sep 24 - Program Slicing (slides)
• Sep 26 - Linear Algebra on SQL presented by the Lannisters (reading | slides)
• Oct 1 - Interpretable Deep Learning (reading 1 | reading 2)
• Oct 3 - SkyServer on MonetDB presented by SQLatin (reading | slides)
• Oct 8 - Software Transactional Memory (reading)
• Oct 10 - Skyserver (continued)
• Oct 15 - NoDB / RAW (paper 1 | paper 2)
• Oct 17 - Group Presentations
• Oct 22 - Legorithmics (paper)
• Oct 24 - Streaming (paper)
• Oct 29 - Scan Sharing (paper)
• Oct 31 - Group Presentations

## Project Teams

### The Lannisters

• Varsha Ganesh
• Lakshmi Narasimhavihari Vemuri
• Srinivas Rishindra Rishindra Pothireddi

(Linear Algebra on Spark)

### SQLatin

• Cheng En Chuang
• William Lawrence Stewart
• Zhenyu Yang
• Komlan T Aziagba

(Simulating Data from SQL Logs)

### Alpha Nebula

• Deepak Ranjan
• Yash Narendra Saraf

(Replicate CrimsonDB)

## Project Seeds

### Replicate Learned Index Structures

A few years ago, a group at Google/MIT/Brown developed a new form of data structure for read-heavy workloads based on the observation that the goal of an index structure was to minimize random IOs by getting you as close to a target value as possible. If your data values have a uniformly distributed key, you can get reasonably close to any target key by taking $position = \frac{target}{high - low}$.
This type of trick works on on any dataset where you can devise a magic function $f(target)$ that returns a position reasonably close to your target (i.e., mirroring the CDF). The paper's insight is that a small neural network, trained on the CDF of the keys could be used to implement $f$. The overarching goal of this project is to replicate the author's results, implementing components as needed, across a wide range of datasets and workloads, and comparing learned data structures to alternative indexes. For comparison points, see Trevor Brown's homepage, which includes a catalog of competitive datastructures. A successful project would ideally confirm results in the paper, and identify workloads not explored in the paper where learned datastructures do not perform as well as other options.

### Replicate CrimsonDB

Papers: CrimsonDB Website

Log Structured Merge Trees are a family of write-favored index structures that maintain data in a set of internally sorted runs of data. Runs are organized into "levels" or "tiers" where runs within a level are of comparable size and typically a constant factor larger than the prior level. As data is added, runs are merged into larger runs, moving data to progressively lower/larger levels. Although the basic principles are similar across LSM-Trees, there are a wide range of specific implementations, each making a range of different design decisions. The researchers developing CrimsonDB have developed a generic LSM-tree framework generalizing, and building on most popular LSM-tree implementations. The overarching goal of this project is to replicate the author's results, implementing components as needed, across a wide range of datasets and workloads, and comparing learned data structures to alternative indexes. For comparison points, see Trevor Brown's homepage, which includes a catalog of competitive datastructures. A successful project would ideally confirm specific results from CrimsonDB publications, and identify workloads or configurations where the CrimsonDB approach does not perform as well as other options.

### A Linear Algebra Optimizer on Spark

Linear algebra and relational algebra share much in common.
Both deal with computationally straightforward operations replicated over large data. Both have standard equivalence rules that can be used to re-organize computation into a more efficient, yet equivalent form. The key difference is that linear algebra expressions operate over dense, heavily structure data, while relational algebra targets sparse heavily structured data. There have been several efforts to bridge the two (the paper above being one such effort), which if successful could create a much needed bridge between databases and tools for machine learning. The goal of this project is to integrate some of these ideas to a distributed relational data processing framework: Spark. A successful project would demonstrate an efficient tool for computing standard linear algebra operations (at a minimum: matrix/vector addition/multiplication) through Spark's dataset infrastructure.
Hacking on Spark's Catalyst optimizer will be likely required to be successful.

### Simulating Data from SQL Logs

Systems work in databases requires effective benchmarks, which in turn require two things: Realistic query/update workloads and realistic datasets. Datasets are plentiful, and query/update logs can also be found if one looks hard enough (e.g., SDSS, PhoneLab), but datasets with accompanying query/update logs are extremely rare. The goal of this project is to work backwards: Given a log of SQL queries, DDL operations, and optionally accompanying statistics (e.g., how many records are returned for a query), can you develop a model of the data that can be used to synthesize a dataset for the workload. For example, a SQL query constrains the schema of all tables it accesses, mandating that they contain specific fields. Similarly, an equi-join between two collumns suggests a similar domain, if not distribution of data. A successful project would develop a model of constraints recoverable from a SQL log, use something like a constraint solver to synthesize a datset that satisfies these constraints, and evaluate its performance on one or more logs.

Although classical databases support SQL DDL operations, there are a wide variety of reasons why one might wish to keep data immutable. (1) Most data sources (CSV files, Hive, URLs) are either unfriendly to point updates or outright don't support writes. (2) Changing data in-place loses track of older versions of the data, which may be useful for some applications.
Reenactment is a technique for translating SQL DDL operations (INSERT, UPDATE, DELETE) into equivalent queries.
Similar techniques have been developed for SQL DML operations (CREATE, DROP, ALTER), as in the PRISM workbench paper above. Combining both, we can simulate DDL/DML operations through view definitions: each version of the database is a new view defined relative to the previous one. Unfortunately, the complexity of this approach grows with the number of DDL/DML operations applied. The core challenge of this project is to work out ways to make efficient query processing possible on reenactment-style data tables. You may find it convenient to work with an existing relational-agebra based query engine such as Mimir

### Schema Discovery for Directories and Zip Files

(Supervised by Will Spoth)

Researchers, scientific tools, and data repositories often provide datasets not as individual data files, but rather as directory hierarchies of files (e.g. execution traces).
Ingesting such datasets into a standard data analytics tool like Spark, a spreadsheet, or a relational database is challenging, as it requires users to explore and understand the structure of each individual file, as well as how all of the files relate to one another. The goal of this project is to automate the process, or at least streamline it with a human in the loop. A successful project would develop an a tool (could be graphical or textual) that targets a directory, infers a structure over it, and proposes a set of relational tables through which files in the directory may be queried. As a bonus goal, a project could also develop a query optimizer that allows efficient in-situ querying of the data. For this project, you may wish to build on existing work on schema detection in JSON and XML data (as in DataGuides above).