Addressing the challenges of big data requires a combination of human intuition and automation. Rather than tackling these challenges head-on with build-from-scratch solutions, or through general-purpose database systems, developer and analyst communities are turning to building blocks: Specialized languages, runtimes, data-structures, services, compilers, and frameworks that simplify the task of creating a systems that are powerful enough to handle terabytes of data or more or efficient enough to run on your smartphone. In this class, we will explore these fundamental building blocks and how they relate to the constraints imposed by workloads and the platforms they run on.
Coursework consists of lectures and a multi-stage final project. Students are expected to attend all lectures. Projects may be performed individually or in groups. Projects will be evaluated in three stages through code deliverables, reports, and group meetings with either or both of the instructors. During these meetings, instructors will question the entire group extensively about the group's report, deliverables, and any related tools and technology.
After the taking the course, students should be able to:
Design domain specific query languages, by first developing an understanding the common tropes of a target domain, exploring ways of allowing users to efficiently express those tropes, and developing ways of mapping the resulting programs to an efficient evaluation strategy.
Identify concurrency challenges in data-intensive computing tasks, and address them through locking, code associativity, and correctness analysis.
Understand a variety of index data structures, as well as their application and use in data management systems for high velocity, volume, veracity, and/or variety data.
Understand query and program compilation techniques, including the design of intermediate representations, subexpression equivalence, cost estimation, and the construction of target-representation code.
Adaptive indexes are a mechanisms to improve the performance of a database by modifying indexing structures "on-the-fly" instead of in one large and costly bulk operation. Adaptive index have been shown to work very well, but only on worklaods that exhibiti sepcific characteristics. JITDs are a generalization of adapative indexes, allowing for the expression of many different types of adaptive indexes. To do this, they provide basic building blocks called nodes that define index structure. How nodes "fit" together ends up defining index behavior.
This project is aimed at coming up with and studying a specific workload and structuring a policy to manage the creation and usage of nodes based on this workload. The policy should result in runtime behavoir that is beneficial for the workload over classic, workload agnostic policies.
While "big data" is all the rage, another important type of data is the data gathered and curated on your mobile devices. We call this data "pocket data".
In this project you will be given access to anonymized traces of database usage patterns on mobile devices. This data has been gathered "in the wild" through the phonelab project. You will be tasked with analyzing this data set and characterising different applicaitons based on how they utilize pocket data. Based on your analysis, you will be create a benchmark that exhibits the characteristics you observed and test the performance of modern mobile databases and compare and contrast how they perform with respect to classic database implementations.
Mimir is a probabilistic database for data cleaning. The idea is to acknowledge that, while automation is amazing and makes data management easy, automation usually relies on heuristics that could screw things up. Instead of hiding potential screw-ups from the user, Mimir keeps track of where things might go wrong, communicates that information to users, and helps them overcome potential consequences of screw-ups.
While Mimir makes data quality more intuitive, it's still limited to classical SQL queries. For this project, you will explore a variety of graphical user interfaces for data, and integrate one or more of them into Mimir itself.
The goal of this project idea is to first apply a few performance improvements to Mimir's existing SQLite backend, and then port those improvements to Postgres. The main optimization target is a sort of placeholder that Mimir uses to mark places in the data where automated heuristics needed to make a choice (that could be screwed up). The placeholders (or VG Terms) get filled in with Mimir's best guess when the user runs a query... a process that is pretty expensive the way Mimir does it right now. After improving the situation for SQLite, the next step of the project (time permitting) would be to port Mimir (and the fix) to Postgres.
The course will involve lectures and readings drawn from an assortment of academic papers selected by the instructors. There is no textbook for the course.
Students may discuss and advise one another on their lab projects, but groups are expected to turn in their own work. Cheating on any course deliverable will result in automatic failure of the course. The University policy on academic integrity can be reviewed at: http://academicintegrity.buffalo.edu/policies/index.php
If you have a diagnosed disability (physical, learning, or psychological) that will make it difficult for you to carry out the course work as outlined, or that requires accommodations such as recruiting note-takers, readers, or extended time on exams or assignments, please advise the instructor during the first two weeks of the course so that we may review possible arrangements for reasonable accommodations. In addition, if you have not yet done so, contact the Office of Accessibility Resources (formerly the Office of Disability Services).