The ODIn Lab - Research Projects

Making Compilers and Program Analysis Declarative

Modern compilation pipelines (compilers, optimizers, program analysis tools) consume, transform, and produce abstract syntax trees. If you squint a little, these processes start to look a lot like database queries, updates, and views. The ODIn Lab is looking at ways to leverage expertise from the database community to improve the performance of compiler tool chains, while at the same time making it easier to write compilers and program analyses by decoupling the core specification logic from logic aimed at making the compiler faster.

Active Participants:

Darshana Balakrishnan, Nick Brown, Victoria Dib

Alumni:

Ankur Upadhyay, Daniel Bellinger, Hank Lin

Sponsors:

NSF (Award IIS-1617586)

Recent Papers:

On-Disk Datalog Engine using Semirings

Nick Brown, Oliver Kennedy

OnDBD 2024 (Workshop) [ paper | bibtex ]

@inproceedings{brown:2024:ondbd:on-disk,
   author = {Brown, Nick and Kennedy, Oliver},
   title = {On-Disk Datalog Engine using Semirings},
   booktitle = {OnDBD},
   year = {2024}
}

Multiquery Optimization for Declarative Compilers

Darshana Balakrishnan, Oliver Kennedy, Lukasz Ziarek, Johannes Luong, Hinnerk Gildhoff, Gaurav Saxena

OnDBD 2024 (Workshop) [ paper | talk | bibtex ]

@inproceedings{balakrishnan:2024:ondbd:multiquery,
   author = {Balakrishnan, Darshana and Kennedy, Oliver and Ziarek, Lukasz and Luong, Johannes and Gildhoff, Hinnerk and Saxena, Gaurav},
   title = {Multiquery Optimization for Declarative Compilers},
   booktitle = {OnDBD},
   year = {2024}
}

TreeToaster: Towards an IVM-Optimized Compiler

Darshana Balakrishnan, Carl Nuessle, Oliver Kennedy, Lukasz Ziarek

SIGMOD 2021 [ paper | bibtex ]

@inproceedings{balakrishnan:2021:sigmod:treetoaster,
   author = {Balakrishnan, Darshana and Nuessle, Carl and Kennedy, Oliver and Ziarek, Lukasz},
   title = {TreeToaster: Towards an IVM-Optimized Compiler},
   booktitle = {SIGMOD},
   year = {2021}
}

Draupnir [💾📜]

Multiple efforts in the PL community have adopted Datalog for compiler specification, due to its ability to easily define recursive properties. However, datalog also conflates identity and value, and as a result it struggles with aggregation. Draupnir is a compiler for a datalog-adjacent language that draws on group-/ring-theoretic provenance models (e.g., Semiring Provenance, Ring Databases) to capture aggregation.

Spinning Rust [💾📜]

Team: Nick Brown

As the PL community adopts database techniques for building compilers, the scale of analyses that appear feasible is growing. For analyzing binaries or large codebases like Chrome or Linux, purely in-memory techniques are no longer viable. The Spinning Rust project is exploring how we can adapt classical on disk (out-of-core) database techniques for use with declarative program analysis.

DiDB

Team: Victoria Dib

Incremental computation produces a unique data access pattern that is highly predictable. The DiDB (Distributed Incremental Data Base) project is exploring ways to leverage the predictability of these access patterns to minimize the overhead of concurrent updates.

Automating Re-usable and Reproducible Data Science

Data Science is frequently an iterative process, with scientists frequently revisiting and past efforts and adapting them for new research goals. A key challenge in this process is ensuring that efforts are re-used safely. Just like compilers for strongly typed languages help users understand the potential implications of changes in their code, the ODIn lab is looking to develop similar techniques to assist data scientists in developing safe, re-usable, and reproducible data science pipelines.

Active Participants:

Pratik Pokharel

Alumni:

Poonam Kumari, William Spoth, Ting Xie, Ying Yang

Sponsors:

NASA (Awards CURE-C STTR Phase 2, CURE-C STTR Phase 1)
NSF (Awards CNS-2125516, ACI-1640864)

Recent Papers:

Embedding The Context of Attributes in Longitudinal Study Schemas To Refine Their Semantic Match Candidates

Pratik Pokharel, Oliver Kennedy

OnDBD 2024 (Workshop) [ paper | bibtex ]

@inproceedings{pokharel:2024:ondbd:embedding,
   author = {Pokharel, Pratik and Kennedy, Oliver},
   title = {Embedding The Context of Attributes in Longitudinal Study Schemas To Refine Their Semantic Match Candidates},
   booktitle = {OnDBD},
   year = {2024}
}

Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments

Pratik Pokharel, Juseung Lee, Oliver Kennedy, Jeff Good, Marianthi Markatou, Andrew H. Talal, Raktim Mukhopadhyay

HILDA 2024 (Workshop) [ paper (original) | paper (fixed typos) | bibtex ]

@inproceedings{pokharel:2024:hilda:drag,
   author = {Pokharel, Pratik and Lee, Juseung and Kennedy, Oliver and Good, Jeff and Markatou, Marianthi and Talal, Andrew H. and Mukhopadhyay, Raktim},
   title = {Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments},
   booktitle = {HILDA},
   year = {2024}
}

Overlay Spreadsheets

Oliver Kennedy, Boris Glavic, Michael Brachmann

HILDA 2023 (Workshop) [ preprint | slides | video | bibtex ]

@inproceedings{kennedy:2023:hilda:overlay,
   author = {Kennedy, Oliver and Glavic, Boris and Brachmann, Michael},
   title = {Overlay Spreadsheets},
   booktitle = {HILDA},
   year = {2023}
}

Vizier [🕸️💾📜]

Collaborators: Oliver Kennedy, Boris Glavic, Michael Brachmann, Marianthi Markatou, Raktim Mukhopadhyay, Arpan Dharia, Andrew H. Talal, Juliana Freire, Nachiket Deo, Su Feng, Aaron Huber, William Spoth, Ying Lu, Beda Hammerschmidt, Zhen Hua Liu, Poonam Kumari, Fatemeh Nargesian, Heiko Mueller, Sonia Castelo, Carlos Bautista, Zhengjie Miao, Qitian Zeng, Chenjie Li, Sudeepa Roy, Juliana Freire, Remi Rampin, Ying Yang, Ting Xie, Dieter Gawlick, Bahareh Sadat Arab, Eric S. Chan, Adel Ghoneimy, Seokki Lee, Xing Niu, Bahareh Arab, Vasudha Krishnaswamy

Notebooks are a convenient tool for data exploration, allowing users to rapidly explore datasets, while preserving a record of the steps they took to get to their results. However, the notebook metaphor takes several design shortcuts in the name of efficiency that make it difficult to create safely reusable notebooks, and that create confusion for novice users. Vizier is the prototype of a new "workbook" metaphor that presents a notebook style interface to a "microkernel workflow system", which keeps cell results up-to-date, and doesn't have a kernel that needs to be re-started and re-populated.

Drag, Drop, Merge [📜]

Team: Pratik Pokharel

Collaborators: Oliver Kennedy, Poonam Kumari, Juseung Lee, Jeff Good, Marianthi Markatou, Andrew H. Talal, Raktim Mukhopadhyay

Over the course of a multi-year (longitudinal) study, as social sciences researchers gain an improved understanding of the study's domain and subjects, it may be necessary to make incremental changes to the study's data collection procedures. Drag, Drop, Merge (DDM) is a set of tools for working with data collected under evolving methodologies, allowing study designers to flag differences between study iterations, and allowing researchers to quickly select the subset of the data that is appropriate for their analysis.

Safe, Efficient Management of Uncertain Data

Ambiguous, incomplete, conflicting, and otherwise uncertain data is difficult to use safely, often requiring days or even weeks of careful study to understand the uncertainty's implication on downstream analyses. Exploration, and re-use of uncertain data pose a particularly significant challenge, as it becomes easy to lose track of assumptions made during one phase of a data science project. The ODIn lab is exploring how uncertainty in data can be precisely modeled without impeding a researcher's ability to explore or analyze their data.

Alumni:

Aaron Huber, Poonam Kumari, William Spoth, Ting Xie, Gourab Mitra, Vinayak Karuppasamy, Arindam Nandi, Niccolò Meneghetti, Ying Yang, Sneha Krishnamurthy, Anand Sankar Bhagavandas, Shivang Aggarwal

Sponsors:

NASA (Awards CURE-C STTR Phase 2, CURE-C STTR Phase 1)
Oracle University Relations (Awards liu-2641, 9325-756926)
NSF (Awards IIS-1956149, IIS-1750460, ACI-1640864)
The US Naval Postgraduate School (Award N00244-16-1-0022)

Recent Papers:

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

Aaron Huber, Oliver Kennedy, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic

SIGMOD 2025 [ paper | test-scripts | fastpdb | bibtex ]

@inproceedings{huber:2025:sigmod:fastpdb,
   author = {Huber, Aaron and Kennedy, Oliver and Rudra, Atri and Zhao, Zhuoyue and Feng, Su and Glavic, Boris},
   title = {FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds},
   booktitle = {SIGMOD},
   year = {2025}
}

Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data

Su Feng, Boris Glavic, Oliver Kennedy

pVLDB 2023 [ paper | bibtex ]

@article{feng:2023:pvldb:efficient,
   author = {Feng, Su and Glavic, Boris and Kennedy, Oliver},
   title = {Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data},
   journal = {pVLDB},
   year = {2023}
}

Computing expected multiplicities for bag-TIDBs with bounded multiplicities

Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, Atri Rudra

ArXiv 2022 (Tech Report) [ paper | bibtex ]

@techreport{feng:2022:arxiv:computing,
   author = {Feng, Su and Glavic, Boris and Huber, Aaron and Kennedy, Oliver and Rudra, Atri},
   title = {Computing expected multiplicities for bag-TIDBs with bounded multiplicities},
   institution = {ArXiv},
   year = {2022}
}

Mimir [💾📜]

Collaborators: Aaron Huber, Oliver Kennedy, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic, Juliana Freire, Michael Brachmann, Poonam Kumari, Zhengjie Miao, Qitian Zeng, Chenjie Li, Sudeepa Roy, Niccolò Meneghetti, Wolfgang Gatterbauer, William Spoth, Ting Xie, Ying Yang, Beda Hammerschmidt, Zhen Hua Liu, Dieter Gawlick, Bahareh Sadat Arab, Eric S. Chan, Adel Ghoneimy, Seokki Lee, Xing Niu, Said Achmiz, Heiko Mueller, Bahareh Arab, Vasudha Krishnaswamy, Arindam Nandi, Ronny Fehling, Jan Chomicki, Gourab Mitra, Vinayak Karuppasamy, Sneha Krishnamurthy, Anand Sankar Bhagavandas, Shivang Aggarwal

Mimir adds support for uncertain data to Apache Spark by using lightweight annotations called "Caveats" that are propagated through queries. Caveated data and records can carry additional metadata (e.g., textual descriptions of uncertainty, or details about the scope of the uncertainty), but this information is not propagated during normal query evaluation. By only propagating the presence of a caveat in a query result, the performance impact on query evaluation is minimal, and the content of the resulting caveats can be subsequently derived through a form of program slicing applied to queries.

FastPDB [📜]

Collaborators: Aaron Huber, Oliver Kennedy, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic

Although probabilistic databases (PDBs) are incredibly important for helping data scientists keep track of uncertainty and ambiguity in their data, existing PDBs can also be orders of magnitude slower than normal database engines. FastPDB is the first system to couple a PDB with an Approximate Query Processing engine, providing accurate approximations of probabilistic query results with competitive runtimes.

Uncertainty Annotated Databases [📜]

Collaborators: Su Feng, Boris Glavic, Oliver Kennedy, Aaron Huber

The UADB project, in collaboration with UIC and Nanjing Tech, is a lightweight model of incomplete data based on ranges. Instead of attempting to produce exact results, UADBs instead answer queries with bounds (both lower and upper) on the space of possible results. In addition to providing more valuable feedback for users, these bounds can be relaxed to improve performance.

Paused and Completed Projects

Loki [📜]

Collaborators: Poonam Kumari, William Spoth, Oliver Kennedy, Fatemeh Nargesian, Gourab Mitra

The Label Once and Keep It project looks at how we can incrementally curate data lakes by tracking how people integrate data from multiple sources. Like popular work on Unionability, our goal is to identify related datasets, but we also support datasets that must be transformed (or 'mapped') in order to be unionable.

Status: Evolved into the DDM project

Pocketdata [📜]

Collaborators: Carl Nuessle, Oliver Kennedy, Lukasz Ziarek, Jerry Antony Ajay, Geoffrey Challen, Gökhan Kul, Gourab Mitra, Grant Wrazen

Mobile phone CPUs can slow down to save power. An OS component called the 'governor' chooses when and how this happens. It turns out most governors on Android are over-engineered, making predictions based on useless data. The Pocketdata project explores how we can use live information from apps to make better choices for power management.

Status: Project completed

Jxplain [📜]

Collaborators: William Spoth, Oliver Kennedy, Ying Lu, Beda Hammerschmidt, Zhen Hua Liu

Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. The Jxplain system adopts a set of heuristics that effectively resolve this ambiguity.

Status: Project completed

Ettu [📜]

Collaborators: Ting Xie, Varun Chandola, Oliver Kennedy, Gökhan Kul, Duc Thanh Luong, Shambhu Upadhyaya, Patrick Coonan, Shambhu Upadhyay, Thomas Mitchell

One of the greatest threats to a the security of a database system comes from within: Users who have been granted access to data using it in a malicious or illegitimate way. Our goal is to develop new types of statistical signatures for a user or role's behavior as they access a database. Using these signatures, we can identify non-standard behvaior that could be evidence of malicious activity.

Status: Project completed

Leaky Joins [📜]

Collaborators: Ying Yang

One of the primary challenges in graphical models is inference, or re-constructing a marginal probability from the graphical model's factorized representation. While tractable for some graphs, the cost of inference grows exponentially with the graphical model's complexity, necessitating approximation for more complex graphs. The leaky join algorithm couples together incremental view maintenance with approximate query processing to produce an anytime algorithm for inference in a very large-scale Bayes net.

Status: Project completed

DBToaster [🕸️📜]

Collaborators: Christoph Koch, Yanif Ahmad, Oliver Kennedy, Milos Nicolic, Andres Nötzli, Daniel Lupei, Amir Shaikhana

DBToaster is an SQL-to-native-code compiler. It generates lightweight, specialized, embeddable query engines for applications that require real-time, low-latency data processing and monitoring capabilities. The DBToaster compiler generates code that can be easily incorporated into any C++ or JVM-based (Java, Scala, ...) project. Since 2009, DBToaster has spearheaded the currently ongoing database compilers revolution. If you are looking for the fastest possible execution of continuous analytical queries, DBToaster is the answer. DBToaster code is 3-6 orders of magnitude faster than all other systems known to us.

Status: This project continues at EPFL and the University of Edinburgh; Here, it evolved into the declatative compiler project

Pip [📜]

Collaborators: Suman Nath, Steve Lee, Slawek Smyl, Charles Loboz, Oliver Kennedy, Christoph Koch

Pip was one of the first probabilistic database systems to cope with value-level uncertainty (as opposed to row-level uncertainty). In contrast to purely sampling-based approaches like MCDB, Pip was able to produce exact analytical results in a fraction of the time.

Status: This project evolved into the Mimir system

This page last updated 2025-07-15 10:35:33 -0400