Data lakes are becoming increasingly popular.  However many in the database community have been hesitant to applaud this development, as a lack of curation can easily cause a data lake to turn into a data swamp.  These are the extreme ends of an ongoing debate about data curation: The NoSQL view that data curation can be handled entirely ad-hoc on a per-project basis, against the ETL view that analysts must be shielded from the harsh realities of real-world data.  In practice, requirements often lie somewhere in between.  Completely isolating analysts from the low-veracity, high-volatility, and high-variability of collected data imposes high upfront costs for organizational efforts that may never be needed. Conversely, shunting the effort of curation to analysts can lead to duplication of effort when multiple analysts invariably attempt to independently curate the same data.

The intuitive data interpretation project seeks to transform the binary decision between upfront and lazy curation into a gradient by making ETL processes operate “On-Demand”.  An ETL pipeline with On-Demand capabilities identifies potential tradeoffs between curation efforts — for example updating a log-file extraction task for a new log format, or adjusting batch-processing frequencies — and the effects of those efforts on the reliability of queries over the information extracted by the ETL process.  In the steady state, an On-Demand ETL process does not differ from traditional ETL.  However, tracking information reliability relative to a given query workload allows curation tasks to be performed adaptively, deferring unnecessary tasks until they are truly necessary.

We have created a system called Mimir (http://github.com/okennedy/mimir).  Mimir generalizes materialized views into a construct called “Lenses” associates potential sources of error with matching curation efforts.  Mimir uses a form of provenance to trace each query result to the set of Lenses that could affect it.  This provenance is initially used to annotate query results with metrics (margins of error, outliers, standard deviation, etc…) that outline the potential effects of data and process uncertainty on the query result.  If the user deems the results to be too uncertain, Mimir then uses the provenance to rank potential curation tasks based on their cost and ability to reduce uncertainty in query outputs.  

Our proof of concept implementation of Mimir demonstrates a Lens for enforcing domain constraints, and will soon include Lenses for schema matching and sampling (under development).  For these Lenses, curation tasks manipulate the data being queried.  For the next phase of the project, we propose to explore Lenses for which curation tasks manipulate the information extraction process itself.  As a focusing example, we will explore the process of extracting structured information from unstructured formats like log data.  Such processes that extract structured information from application logs typically need to evolve as the application does. Errors in this setting can arise as applications add new types of log messages or modify existing log message formats, or if there are unhanded corner-cases in the information extraction process.  As errors arise, the extraction process must be updated.  

As part of the proposed work, we plan to address three specific challenges arising from Lenses modeling evolving processes.  First, for larger source data, efficient query processing requires the information to be maintained in a materialized view.  As the Lens evolves, so too will its materialized output.  However, minor changes to the Lens may not require full re-materialization.  Our initial goal will be to develop a form of incremental view maintenance that reacts to changes in the query rather than changes in the data.  

Second, as the base data evolves, information quality will decrease as the extraction process grows progressively more out-of-date.  Our goal is to minimize this degradation by adopting techniques like the “Flexible Schema Data” management techniques being pioneered by Zhen Hua Liu, as well as a combination of machine learning techniques and careful design patterns.  For example, consider a pattern that matches and extracts structure from a out of memory error log messages.  If the format of memory log errors changes (e.g., to add a new field), can we still use the old pattern to extract (approximate) information from the new messages.  Even if the new format can not be extracted completely, it may still be possible to extract specific features from the new entry format (e.g., that the format implies a memory error).

Finally, time permitting, we will explore the idea of post-hoc query feedback.  As a Lens evolves, queries written for and run on earlier incarnations of the lens may become invalid.  For example, changes in information granularity may result in queries being run on earlier versions of the data being invalidated or producing results with unexpectedly high levels of uncertainty.  We would like to allow users to register queries for uncertainty monitoring.  As the lens evolves, as errors are introduced, and as errors are resolved, the system will notify the user if the margin of error on the user’s query results becomes too large.