Note: This was originally an abstract submitted to CIDR. It's based on numerous discussions with lots of people, including but not limited to: Ying Yang, Niccolò Meneghetti, Poonam Kumari, Will Spoth, Aaron Huber, Arindam Nandi, Boris Glavic, Vinayak Karuppasamy, Dieter Gawlick, Zhen Hua-Liu, Beda Hammerschmidt, Ronny Fehling, and Lisa Lu.
Since their earliest days, databases have held themselves to a strict invariant: Never give the user a wrong answer. So ingrained is it in the psyche of the database community, that those violate it really want you to be aware that you're committing sacrilege against Codd. Some examples include adding features to SQL to support continuous data (e.g., MauveDB), adding features to SQL to query bayesian models (e.g., BayesStore), adding features to SQL to tell the database how accurate you want your results to be (e.g., DBO), or adding features to SQL to explicitly ask for specific types of summaries (e.g., MayBMS).
Sadly, by trying to enforce perfection in the database itself, database systems fail to acknowledge that the data being stored is rarely precise, correct, valid, or unambiguous. Emphasizing on certain, deterministic data forces the use of complex, hard-to-manage extract-transform-load pipelines that emit deceptively certain, “truthy” data rather than acknowledging ambiguity or error. The resulting data is often (incorrectly) interpreted as fact by naive users who have no reason to believe otherwise. The problem is getting worse: As more decisions are automated, even small truthiness errors can drastically impact peoples' lives. Data errors in credit reports can cause perfectly honest people to be denied access to credit. Similarly, name matching errors combined with rigid protocols have led to an 8-year old being identified as a terrorist.
System designers must decide between presenting erroneous data as truthful or risk discarding useful information, and many choose the former. The database community has already begun treating uncertainty as a first class primitive in databases. Unfortunately, uncertainty also requires us to rethink how humans interact with data.
Here, industry has done significantly better than the database research community. For example, personal information managers like Apple Calendar and the iOS Phone App increasingly use facts data-mined from email to automatically populate databases in their contacts and calendar applications. For example, the OS X Calendar app finds events in your email and schedules them.
Similarly, the iOS Phone App makes use of phone numbers it finds in your email to predict who's calling you.
Both examples illustrate a number of good design elements:
- The interface keeps uncertain facts distinct or clearly marks them as being guesses.
- The Calendar App uses greyed out boxes and a special calendar for guessed events
- The Phone App explicitly prefixes guessed names with "Maybe: "
- The interface includes intuitive provenance mechanisms that help to put the extracted information in context.
- Both Apps provide a "Show In Mail" link in the detailed information view.
- The interface includes overt feedback options to help the user correct or confirm uncertain data.
- "Add To Calendar" or "Ignore"
We as a database community need to start adapting these techniques to more general data management settings. The presentation layer isn’t the only problem, as identifying sources uncertainty requires developers to invest lots of upfront effort rethinking how they write code. We need to make it worth their while. For example, we might provide infrastructure support to help developers draw generalizations from ambiguous choices. We might streamline imperative language support for uncertainty. Or, we might define higher-order data transformation primitives.
In summary, the illusion of accuracy in database query results can no longer be maintained. Database systems must learn how to acknowledge errors in source data, and how to use this information to effectively communicate ambiguity to users. Moreover, this needs to happen without overwhelming users, without breaking the decades-old abstractions that people understand and use on a day-to-day basis in their work-flows, and without requiring a statistics background from all users.