CSE 662 - Fall 2018

Addressing the challenges of big data requires a combination of human intuition and automation. Rather than tackling these challenges head-on with build-from-scratch solutions, or through general-purpose database systems, developer and analyst communities are turning to building blocks: Specialized languages, runtimes, data-structures, services, compilers, and frameworks that simplify the task of creating a systems that are powerful enough to handle terabytes of data or more or efficient enough to run on your smartphone. In this class, we will explore these fundamental building blocks and how they relate to the constraints imposed by workloads and the platforms they run on.

Coursework consists of lectures and a multi-stage final project. Students are expected to attend all lectures. Projects may be performed individually or in groups. Projects will be evaluated in three stages through code deliverables, reports, and group meetings with either or both of the instructors. During these meetings, instructors will question the entire group extensively about the group's report, deliverables, and any related tools and technology.

At initial stage, students are expected to demonstrate a high level of proficiency with the tools, techniques, data structures, algorithms and source code that will form the basis of their project. The group is expected to submit and defend a roughly 5-page report surveying the space in which their project will be performed. This report and presentation constitute 15% of the final grade.
At the second stage, students are expected to provide a detailed design for their final project. A roughly 5-page report should outline the group’s proposed design, any algorithms or data structures being introduced, as well as a strategy for evaluating the resulting project against the current state of the art. This report and presentation constitute 35% of the final grade.
At the final stage, students are expected to provide a roughly 5-page report detailing their project, any algorithms or data structures developed, and evaluating their project against any comparable state of the art systems and techniques. Groups will also be expected to demonstrate their project and present their findings in-class, or in a meeting with both instructors if necessitated by time constraints. This report and presentation constitute 50% of the final grade.

Information

Instructors
- Oliver Kennedy (Capen 212; Office Hours Weds 10:00-11:45)
Time: MWF 12:00-12:50
Location: Clemens 21

Course Objectives

After the taking the course, students should be able to:

Design domain specific query languages, by first developing an understanding the common tropes of a target domain, exploring ways of allowing users to efficiently express those tropes, and developing ways of mapping the resulting programs to an efficient evaluation strategy.
Identify concurrency challenges in data-intensive computing tasks, and address them through locking, code associativity, and correctness analysis.
Understand a variety of index data structures, as well as their application and use in data management systems for high velocity, volume, veracity, and/or variety data.
Understand query and program compilation techniques, including the design of intermediate representations, subexpression equivalence, cost estimation, and the construction of target-representation code.

Course Schedule

Aug 27: Introduction ( slides )
Aug 29: Project Seeds ( slides )
Aug 31: Functional Data Structures
Sept 3: NO CLASS: LABOR DAY
Sept 5: Learned Index Structures ( paper )

Project Seeds

Decentralized IoT Plumbing

Raspberry Pis, Arduinos, and a slew of commercial products make it easy to deploy sensors and actuators backed by beefy computers everywhere. Of course, while deploying them is easy, orchestrating them is hard. So most IoT systems are centralized: The devices report their readings to the cloud, and the cloud tells them what to do.

For example, let's say I want the thermostat to adjust temperature based on the current temperature of the room that I'm in, and I have temperature and motion sensors in each room of my house. All of the sensors could continuously upload both temperature and motion readings to the cloud, which could then figure out what the thermostat should do, and then tell it.

This approach is, of course A BAD IDEA (e.g., https://mickens.seas.harvard.edu/publications/deadbolt-securing-iot-deployments). So what can we do? Well, one idea is to make the devices coordinate among themselves instead.

We can take advantage of the computing power on each of the sensors and the thermostat in a number of ways:

Have the thermostat poll all of the motion/temperature sensors every N seconds
Have the motion/temperature sensors push readings to the thermostat every time they change.
Have the motion/temperature sensors push readings to the thermostat every time they change, but only if the motion sensor is on... or better still, only if the temperature crosses the threshold for the thermostat to start heating the house. Which of these is best? It depends on the design requirements, available computing/network capabilities, and failure modes that we want to support.

In short, the goal of this seed is to design an orchestration manager for a decentralized IoT system. At a high level, the result should: (1) Take input in a declarative form (e.g., a subset of SQL), (2) Propose several action plans for what events each device should watch for, and how they should react, and (3) Decide which action plan is "best". Bonus kudos if you implement a runtime to actually deploy this plan (we can get you some hardware).

Uncertainty-Aware Machine Learning

Most popular machine learning techniques assume that their data is perfect (or at least that the noise outweighs the signal). This project seed explores what you should do if that's not the case. Let's say you don't have enough signal, or you know that specific parts of your data are incomplete or rough approximations. Can you use that information and still be able to generate meaningful models of your data? At a high level, the outcome of this project should be able to take as input an uncertainty-annotated dataset (e.g., with cells might be defined by probability distributions, or values might include bounds or confidence weights), and produce a reliable model that accounts for these potential errors. The specific type of modeling framework (graphical models, neural networks, regression) is up to you, as is/are the type(s) of uncertainty you want to support. Bear in mind that this problem has likely been explored in the past, and it will also be up to you to first assess what techniques people have tried, and what techniques you want to use.

Web-of-Trust for Crowdsourced Data

The "Web-of-Trust" (https://ieeexplore.ieee.org/abstract/document/883720/) was a de-centralized model for identity management from the late '90s. Without getting too deep into the scheme, the idea was to model indirect trust. With one level of indirection: Alice believes Fact A, and Bob trusts Alice, so Bob believes Fact A as well. Unfortunately this leads to a problem... now if Carol also trusts Bob, should Carol believe Fact A also? The idea behind web of trust was to "decay" trust: In the above example, Bob would really trust Alice (factor = 0.9), so he would have a 0.9 belief in Fact A. Now Carol only slightly trusts Bob (factor = 0.5), so her belief in Fact A would be lower (e.g., 0.5 * 0.9 = 0.45). Let's say Carol decides to trust Dave a lot (0.9) and Dave already trusts Alice a lot (0.9). Now Carol's belief in Fact A goes up (by how much depends on the specific scheme)!

The idea here would be to develop a similar scheme for crowdsourced data. Create a system that allows multiple users to contribute, and develop a way for them to see what possible facts there are, and evaluate how much they should trust each of these facts, based on how much they trust other people.

Sensitivity Analysis in Mimir

https://dl.acm.org/citation.cfm?id=1989411 https://dl.acm.org/citation.cfm?id=1989376

These papers explore the idea of sensitivity analysis in databases. Both are built on tuple-independent probabilistic databases: databases that allow "Uncertain" records to appear in tables. Uncertain records are associated with probabilities: There's some chance that these records do (or do not) actually appear in the database. Normally, probabilistic databases are designed to answer, based on these probabilities, how likely some query result is going to be. Sensitivity analysis takes a different approach: How important is any one of these uncertain records? If I knew for certain whether an uncertain record were in/not in the database, how would that change my confidence in a query result?

Unfortunately, tuple-independent probabilistic databases are not generally considered to be a good model for implementing probabilistic databases. It's a model that's easy to reason about, but much harder to do anything useful with. The Mimir probabilistic database implements a more complex model called C-Tables. The goal of this seed is to implement sensitivity analysis to Mimir, likely through some sort of approximation scheme.

Sandboxed Python

Python is an incredibly powerful language, used extensively in the scientific community for data processing through tools like SciPy and Pandas. Although Python is a great way to command distributed systems (e.g., using Spark), deploying it as part of the distributed computation can be hard if the code is untrusted. Python has full access to the filesystem, network stack, and pretty much everything else on the local machine.

The aim of this seed is to design a lightweight wrapper around Python that: (1) Allows python code to be executed without risk to the underlying platform, (2) Allows computation inputs to be delivered to the executing python through some narrow interface, and (3) Allows computation results to be exported through some narrow interface.

Academic Content

The course will involve lectures and readings drawn from an assortment of academic papers selected by the instructors. There is no textbook for the course.

Academic Integrity

Students may discuss and advise one another on their lab projects, but groups are expected to turn in their own work. Cheating on any course deliverable will result in automatic failure of the course. The University policy on academic integrity can be reviewed at: http://academicintegrity.buffalo.edu/policies/index.php

Accessibility Resources

If you have a diagnosed disability (physical, learning, or psychological) that will make it difficult for you to carry out the course work as outlined, or that requires accommodations such as recruiting note-takers, readers, or extended time on exams or assignments, please advise the instructor during the first two weeks of the course so that we may review possible arrangements for reasonable accommodations. In addition, if you have not yet done so, contact the Office of Accessibility Resources (formerly the Office of Disability Services).