The ODIn Lab - Database Systems for Data Quality and Curation

Database Systems for Data Quality and Curation

This course will survey research systems, tools, and techniques for data quality management. Aspects of data quality covered will include interfaces and human-in-the-loop approaches to data refinement, techniques for schema and structure detection, and automated data ingest.

Details

Instructor: Oliver Kennedy (Capen 211; Inside the Library)
Office Hours: Weds: 10AM-Noon
Seminar Time: Weds: 2:00-3:30
Seminar Location: Capen 212A (Inside the Library)

The course is graded Sat/Unsat. For a satisfactory grade:

Present at least one paper listed below
Attend class (Max 3 absences regardless of reason)
Come prepared, having read the paper and ready to participate in discussion about the paper. If necessary, I will call on students randomly to answer questions.

Schedule

Feb 6	Class Canceled due to Snow
Feb 13	William Spoth presents 'Fusing data with correlations'
Feb 20	Jason Kim presents 'A sample-and-clean framework for fast and accurate query processing on dirty data'
Feb 27	Carl Nuessle presents 'ICARUS: Minimizing Human Effort in Iterative Data Completion'
Mar 6	Darshana Balakrishnan presents 'Query Optimization for Dynamic Imputation'
Mar 13	Ting Xie presents 'Incremental knowledge base construction using deepdive'
Mar 20	Spring Break
Mar 27	Poonam Kumari presents 'Wrangler: Interactive visual specification of data transformation scripts'
Apr 3	Poojitha Alahari presents 'Northstar: An Interactive Data Science System'
Apr 10	Carl Nuessle presents 'Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code'
Apr 17	Jason Kim presents 'Descriptive and Prescriptive Data Cleaning'
Apr 24	Poonam Kumari presents 'Data canopy: Accelerating exploratory statistical analysis'
May 1	William Spoth presents 'Controlling False Discoveries During Interactive Data Exploration'
May 8	No One Signed Up (yet)

Suggested Papers

1. A sample-and-clean framework for fast and accurate query processing on dirty data (claimed by Jason Kim)

In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evaluate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a subset of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accuracy compared to query processing without data cleaning and speed compared to data cleaning without sampling.

2. Big Data Linkage for Product Specification Pages

An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to be resolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.

3. Controlling False Discoveries During Interactive Data Exploration (claimed by William Spoth)

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution to integrate the control of multiple hypothesis testing into interactive data exploration systems. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable to interactive data exploration. We therefore discuss a set of new control procedures that are better suited for this task and integrate them in our system, QUDE. Via extensive experiments on both real-world and synthetic data sets we demonstrate how QUDE can help experts and novice users alike to efficiently control false discoveries.

4. DEX: Query Execution in a Delta-based Storage System

The increasing reliance on robust data-driven decision-making across many domains has made it necessary for data management systems to manage many thousands to millions of versions of datasets, acquired or constructed at various stages of analysis pipelines over time. Delta encoding is an effective and widely-used solution to compactly store a large number of datasets, that simultaneously exploits redundancies across them and keeps the average retrieval cost of reconstructing any dataset low. However, supporting any kind of rich retrieval or querying functionality, beyond single dataset checkout, is challenging in such storage engines. In this paper, we initiate a systematic study of this problem, and present DEX, a novel stand-alone delta-oriented execution engine, whose goal is to take advantage of the already computed deltas between the datasets for efficient query processing. In this work, we study how to execute checkout, intersection, union and t-threshold queries over record-based files; we show that processing of even these basic queries leads to many new and unexplored challenges and trade-offs. Starting from a query plan that confines query execution to a small set of deltas, we introduce new transformation rules based on the algebraic properties of the deltas, that allow us to explore the search space of alternative plans. For the case of checkout, we present a dynamic programming algorithm to efficiently select the optimal query plan under our cost model, while we design efficient heuristics to select effective plans that vastly outperform the base checkout-then-query approach for other queries. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). We have implemented DEX prototype on top of git, a widely used version control system. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the baseline.

5. Data canopy: Accelerating exploratory statistical analysis (claimed by Poonam Kumari)

During exploratory statistical analysis, data scientists repeatedly compute statistics on data sets to infer knowledge. Moreover, statistics form the building blocks of core machine learning classification and filtering algorithms. Modern data systems, software libraries, and domain-specific tools provide support to compute statistics but lack a cohesive framework for storing, organizing, and reusing them. This creates a significant problem for exploratory statistical analysis as data grows: Despite existing overlap in exploratory workloads (which are repetitive in nature), statistics are always computed from scratch. This leads to repeated data movement and recomputation, hindering interactive data exploration.

We address this challenge in Data Canopy, where descriptive and dependence statistics are synthesized from a library of basic aggregates. These basic aggregates are stored within an in-memory data structure, and and are reused for overlapping data parts and for various statistical measures. What this means for exploratory statistical analysis is that repeated requests to compute different statistics do not trigger a full pass over the data. We discuss in detail the basic design elements in Data Canopy, which address multiple challenges: (1) How to decompose statistics into basic aggregates for maximal reuse? (2) How to represent, store, maintain, and access these basic aggregates? (3) Under different scenarios, which basic aggregates to maintain? (4) How to tune Data Canopy in a hardware conscious way for maximum performance and how to maintain good performance as data grows and memory pressure increases?

We demonstrate experimentally that Data Canopy results in an average speed-up of at least 10x after just 100 exploratory queries when compared with state-of-the-art systems used for exploratory statistical analysis.

6. Descriptive and Prescriptive Data Cleaning (claimed by Jason Kim)

Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar get report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

7. Fusing data with correlations (claimed by William Spoth)

Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.

8. Holistic Query Evaluation over Information Extraction Pipelines

We introduce holistic in-database query processing over information extraction pipelines. This requires considering the joint conditional distribution over generic Conditional Random Fields that uses factor graphs to encode extraction tasks. Our approach introduces Canopy Factor Graphs, a novel probabilistic model for effectively capturing the joint conditional distribution given a canopy clustering of the data, and special query operators for retrieving resolution information. Since inference on such models is intractable, we introduce an approximate technique for query processing and optimizations that cut across the integrated tasks for reducing the required processing time. Effectiveness and scalability are verified through an extensive experimental evaluation using real and synthetic data.

9. HomeRun: Scalable Sparse-Spectrum Reconstruction of Aggregated Historical Data (claimed by William Spoth)

Recovering a time sequence of events from multiple aggregated and possibly overlapping reports is a major challenge in historical data fusion. The goal is to reconstruct a higher resolution event sequence from a mixture of lower resolution samples as accurately as possible. For example, we may aim to disaggregate overlapping monthly counts of people infected with measles into weekly counts. In this paper, we propose a novel data disaggregation method, called HOMERUN, that exploits an alternative representation of the sequence and finds the spectrum of the target sequence. More specifically, we formulate the problem as so-called basis pursuit using the Discrete Cosine Transform (DCT) as a sparsifying dictionary and impose non-negativity and smoothness constraints. HOMERUN utilizes the energy compaction feature of the DCT by finding the sparsest spectral representation of the target sequence that contains the largest (most important) coefficients. We leverage the Alternating Direction Method of Multipliers to solve the resulting optimization problem with scalable and memory efficient steps. Experiments using real epidemiological data show that our method considerably outperforms the state-of-the-art techniques, especially when the DCT of the sequence has a high degree of energy compaction.

10. ICARUS: Minimizing Human Effort in Iterative Data Completion
(claimed by Carl Nuessle)

An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks.

11. Incremental knowledge base construction using deepdive
(claimed by Ting Xie)

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

12. Northstar: An Interactive Data Science System
(claimed by Poojitha Alahari)

In order to democratize data science, we need to fundamentally rethink the current analytics stack, from the user interface to the “guts.” Most importantly, enabling a broader range of users to unfold the potential of (their) data requires a change in the interface and the “protection” we offer them. On the one hand, visual interfaces for data science have to be intuitive, easy, and interactive to reach users without a strong background in computer science or statistics. On the other hand, we need to protect users from making false discoveries. Furthermore, it requires that technically involved (and often boring) tasks have to be automatically done by the system so that the user can focus on contributing their domain expertise to the problem. In this paper, we present Northstar, the Interactive Data Science System, which we have developed over the last 4 years to explore designs that make advanced analytics and model building more accessible.

13. Query Optimization for Dynamic Imputation
(claimed by Darshana Balakrishnan)

Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples with missing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputation into a cost-based query optimizer, performing necessary imputations on-the-fly for each query. This allows users to immediately explore their data, while the system picks the optimal placement of imputation operations. We evaluate this approach on three real-world survey-based datasets. Our experiments show that our query plans execute between 10 and 140 times faster than first imputing the base tables. Furthermore, we show that the query results from on-the-fly imputation differ from the traditional base-table imputation approach by 0–8%. Finally, we show that while dropping tuples with missing values that fail query constraints discards 6–78% of the data, on-the-fly imputation loses only 0–21%

14. SMOKE: Fine-grained Lineage at Interactive Speed
(claimed by Darshana Balakrishnan)

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing)and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address—and the main focus of this paper—is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes SMOKE, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, SMOKE tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that SMOKE reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that SMOKE meets the latency requirements of interactive visualizations (e.g., < 150ms) and outperforms hand-written implementations of data profiling primitives.

15. Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code
(claimed by Carl Nuessle)

Given a table of data, existing systems can often detect basic atomic types (e.g., strings vs. numbers) for each column. A new generation of data-analytics and data-preparation systems are starting to automatically recognize rich semantic types such as date-time, email address, etc., for such metadata can bring an array of benefits including better table understanding, improved search relevance, precise data validation, and semantic data transformation. However, existing approaches only detect a limited number of types using regular-expression-like patterns, which are often inaccurate, and cannot handle rich semantic types such as credit card and ISBN numbers that encode semantic validations (e.g., checksum).

We developed AUTOTYPE from open-source repositories like GitHub. Users only need to provide a set of positive examples for a target data type and a search keyword, our system will automatically identify relevant code, and synthesize type-detection functions using execution traces. We compiled a benchmark with 112 semantic types, out of which the proposed system can synthesize code to detect 84 such types at a high precision. Applying the synthesized type-detection logic on web table columns have also resulted in a significant increase in data types discovered compared to alternative approaches.

16. Wrangler: Interactive visual specification of data transformation scripts
(claimed by Poonam Kumari)

Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing.