# CSE 662 - Fall 2017

Addressing the challenges of big data requires a combination of human intuition and automation. Rather than tackling these challenges head-on with build-from-scratch solutions, or through general-purpose database systems, developer and analyst communities are turning to building blocks: Specialized languages, runtimes, data-structures, services, compilers, and frameworks that simplify the task of creating a systems that are powerful enough to handle terabytes of data or more or efficient enough to run on your smartphone. In this class, we will explore these fundamental building blocks and how they relate to the constraints imposed by workloads and the platforms they run on.

Coursework consists of lectures and a multi-stage final project. Students are expected to attend all lectures. Projects may be performed individually or in groups. Projects will be evaluated in three stages through code deliverables, reports, and group meetings with either or both of the instructors. During these meetings, instructors will question the entire group extensively about the group's report, deliverables, and any related tools and technology.

1. At initial stage, students are expected to demonstrate a high level of proficiency with the tools, techniques, data structures, algorithms and source code that will form the basis of their project. The group is expected to submit and defend a roughly 5-page report surveying the space in which their project will be performed. This report and presentation constitute 15% of the final grade.
2. At the second stage, students are expected to provide a detailed design for their final project. A roughly 5-page report should outline the group’s proposed design, any algorithms or data structures being introduced, as well as a strategy for evaluating the resulting project against the current state of the art. This report and presentation constitute 35% of the final grade.
3. At the final stage, students are expected to provide a roughly 5-page report detailing their project, any algorithms or data structures developed, and evaluating their project against any comparable state of the art systems and techniques. Groups will also be expected to demonstrate their project and present their findings in-class, or in a meeting with both instructors if necessitated by time constraints. This report and presentation constitute 50% of the final grade.

## Information

• Instructors
• Oliver Kennedy (Davis 338H; Office Hours Weds 1:00-3:00)
• Time: MWF 12:00-12:50
• Location: Knox 04
• Slack Channel: #cse662-fall2017 on https://ubodin.slack.com/

## Course Objectives

After the taking the course, students should be able to:

• Design domain specific query languages, by first developing an understanding the common tropes of a target domain, exploring ways of allowing users to efficiently express those tropes, and developing ways of mapping the resulting programs to an efficient evaluation strategy.

• Identify concurrency challenges in data-intensive computing tasks, and address them through locking, code associativity, and correctness analysis.

• Understand a variety of index data structures, as well as their application and use in data management systems for high velocity, volume, veracity, and/or variety data.

• Understand query and program compilation techniques, including the design of intermediate representations, subexpression equivalence, cost estimation, and the construction of target-representation code.

## Course Schedule

• Aug. 28 : Introduction [ slides | form groups ]
• Aug. 30 : Project Seeds [ slides ]
• Sept. 01 : Functional Data Structures [ slides ]
• Sept. 04 : No Class, Labor Day
• Sept. 06 : Database Cracking [ paper | feedback | slides ]
• Sept. 08 : Just-in-Time Data Structures [ paper | slides ]
• Sept. 11 : Incomplete Databases 1 [ paper | feedback | notes ]
• Sept. 13 : Incomplete Databases 2
• Sept. 15 : Sampling From Probabilistic Queries
• Sept. 18 : Mimir [ paper ]
• Sept. 20 : Probabilistic Constraint Repair [ paper | feedback ]
• Sept. 22 : Online Aggregation [ paper ]
• Sept. 25 : R-Trees and Multidimensional Indexing [ paper ]
• Checkpoint 1 report due by 11:59 PM Sept. 26
• Sept. 27 : Elevator Pitches
• Sept. 29 : CSE50/Emerging Trends in Computing [ agenda ]
• Oct. 02 : BloomL [ paper-1, paper-2 ]
• Oct. 04 - Oct. 6 : Oliver Away (Content TBD)
• Oct. 09 : NoDB [ paper ]
• Oct. 11 - Oct. 13 : Student Project Presentations
• Oct. 16 : Lazy Transactions [ paper ]
• Oct. 18 - Oct. 20 : Student Project Presentations
• Checkpoint 2 report due by 11:59 PM Oct. 22
• Oct. 23 - Oct. 27 : Checkpoint 2 Reviews
• Oct. 30 : Streaming [ paper ]
• Nov. 01 - Nov. 3 : Student Project Presentations
• Nov. 06 - Nov. 10 : Luke Ziarek: Software-Transactional Memory
• Nov. 13 : Scan Sharing [ paper ]
• Nov. 15 - Nov. 17 : Student Project Presentations
• Nov. 20 : Declarative Games [ paper ]
• Nov. 22 - Nov. 24 : Fall Break
• Nov. 27 : Buffer
• Nov. 29 - Dec. 1 : Student Project Presentations
• Checkpoint 3 report due by 11:59 PM Dec. 3
• Dec. 04 - Dec. 8 : Checkpoint 3 Reviews
• Demo Day Time/Location To Be Announced

## Project Seeds

#### Deferred Constraint-Based Data Validation

There are a number of reasons that data might go bad: sensor errors, data entry errors, lag spikes, filesystem corruption, and more. One thing is certain though: you don't want to make decisions based on bad data. What people will often do is do basic sanity checks. For example, if we have a record of someone checking out of a hospital, we should have a record of them checking in at some point earlier. Declaring these sanity checks is hard, but fixing violations is even harder. In this project, you will explore ways to safely defer repairing the data. Challenges include:

1. Deciding what types of sanity constraints you want to be able to support.
2. Interfacing with an existing database (Spark, SQLite, or Oracle) to identify sets of tuples that come together to violate a sanity constraint.
3. Using Mimir to warning users when a query result depends on a tuple that participates in a violation
4. Suggesting and ranking modifications that repair violations

#### Query Sampling Optimizer

Most probabilistic database systems aim to produce all possible results. A few, most notably MCDB, instead generate samples of possible results. The basic idea is to split the database into a fixed number (N) of possible worlds, and run the query on all N possible worlds in parallel. There are actually a few different ways to do this. Three relatively common examples include:

• Naive: Literally run N copies of the query and union the results at the end.
• Interleave: Tag each tuple with the possible world that it comes from, and then just run one query. Make sure the query ensures that tuples from different possible worlds can't interact (i.e., Joins always happen between tuples from the same world and the world becomes another group-by column)
• Tuple Bundle: Create mega-tuples, that represent alternative versions of the same tuple in different possible worlds. If an attribute value is the same in all possible worlds store only one copy of it. (See MCDB)

Perhaps counterintuitively, our preliminary implementations of the Interleave and Tuple Bundle algorithms suggest that none of these approaches will be the best in all cases. For example, in a simple select-aggregate query, tuple-bundles are the most efficient. Conversely, if you're joining on an attribute with different values in each possible world, interleave will be faster. We suspect that there are some cases where Naive will win out as well. The aim of this project is to implement a query optimizer for sampling-based probabilistic database queries. If I hand you a query, you tell me which strategy is fastest for that query. As an optional extension, you may be able to interleave different strategies, each evaluating a different part of the query.

#### Explaining Offset-Outliers

When looking at queries, a common question is "why is this result the way it is?" While a broad question, database researchers have been hard at work isolating and addressing specific cases. For this particular project, we'd like to explore one specific category of explanation, where users have provided us with points of stability: Group-by aggregates that are supposed to remain stable over time. For example, consider  SELECT author, STDDEV(cnt) FROM ( SELECT author, year, COUNT(*) AS cnt FROM Publications );  This query gives the variation per user in terms of number of publications per year. We might use a query like this to define a constraint that says "For any author, the number of publications per year stays roughly constant". Constraints like this help can us to explain aggregate values. For example let's say you run the following query and the result is lower than expected.  SELECT COUNT(*) FROM Publications WHERE author = 'Alice' AND venue = 'ICDE' AND year = 2017;  If you ask "Why is this result so low", the system can look at the above constraint and figure out that there's another aggregate query that is higher than usual (to preserve the stability of the publications/year constraint defined above)  SELECT COUNT(*) FROM Publications WHERE author = 'Alice' AND venue = 'ICDE' AND year = 2017;  The aim of this project would be to implement a simple frontend to an existing database system (Spark, SQLite, or Oracle) that accepts a set of constrants and answers questions like this.

(This project is part of ongoing joint work with Boris Glavic and Sudeepa Roy)

#### Physical Layouts for Multiversion (Uncertain) Data

Classical versioning is a monotone operation: It’s rare that someone will want to maintain parallel versions of the data. Conversely, data cleaning requires us to keep track of many different versions of a dataset. For example, there exist some very powerful regression algorithms that can detect outliers very effectively. However, these techniques can't really point out why those outliers are there. Maybe there's missing context that would explain the outlier? Maybe there's an actual data error? Maybe there's a problem with how the data is being interpreted. In short, every outlier should be classified as an "optional" version. In other words, for every outlier, we may wish to fork the data, creating one set of versions with and an otherwise equivalent set without the outlier. Obviously, this will create an exponential number of versions, so we need some ways to eliminate redundancy in the stored version.

Fundamentally, the aim of this project is to outline a range of different workflow options for uncertain data, and derive one or more techniques for how to store, sort, index, and query this data.

#### Garbage Collection in Embedded Databases

The PocketData project explores the performance of database systems designed for embedded devices (e.g., smartphones, tablets, or sensor networks). As part of this project, we have developed several benchmark workloads aimed at Android and other Java-based settings. Although Java is a garbage-collected language, the short duration of most of our workloads means that they rarely (if ever) involve interactions with the garbage collector. The aim of this project is to develop an understanding of when (if ever) the garbage collector impacts the performance of embedded databases like SQLite or BerkeleyDB.

(This project will be co-advised by Lukasz Ziarek)

###### Background Material:

Indexes work by reducing the effort required to locate specific data records. For example, in a tree index, if the range of records in a given subtree doesn't overlap with the query, the entire subtree can be ruled out (or ruled in). Not surprisingly, this means that data partitioning plays a large role in how effective the index is. The fewer partitions lie on query boundaries, the less work is required to respond to those queries.

Partitioning is especially a problem in 2-dimensional (and 3-, 4-, etc... dimensional) indexes, where there are always two entirely orthogonal dimensions to partition on. Accordingly, there's a wide range of techniques for organizing 2-dimensional data, including a family of indexes based on R-Trees. The aim of this project is to develop a "dynamic" r-like tree structure that adaptively partitions its contents, and (if time permits) that adapts its partition boundaries to changing workloads.

#### Mimir on SparkSQL

Spark's DataFrames are a powerful set of relational-algebra-like primitives for defining computation that can efficiently run locally or in a distributed setting. However, because Spark aimed at predominantly analytical workloads, it can not be used directly as a drop-in replacement for SQLite. The aim of this project is to transition a large database application (Mimir) from a classical relational database to Spark. Key challenges include:

1. Spark is generally designed to be read-only. Mimir needs to keep track of a variety of metadata. That means this metadata needs to be stored somewhere on the side. Step one will to create a metadata storage and lookup layer.
2. Rewriting components of Mimir to use this metadata layer.
• The View Manager stores and tracks view definitions and associated metadata.
• The Model Manager stores and tracks materialized instances of a variety of different models.
3. The Compiler infrastructure and Backend will need to be modified to work with Spark Data Frames. Because Data Frames are relatively close to relational algebra, it may be best to go directly from one to the other without using SQL as an intermediate.