What's Wrong With Probabilistic Databases? (Part 1)

{rants}

A large chunk of my graduate work has to do with a subfield of database research called probabilistic databases.

The idea is simple: Most databases store precise values. A row of a normal database might indicate that Bob's SS# is 199-..-....

A probabilistic database allows users to provide data specified by a probability distribution. Perhaps Bob was a little sloppy filling out a form, and the OCR software couldn't tell whether he intended to put down 199-... or 149-...

It might not be able to determine a precise value for that slot, but it can tell you that Bob's SS# is either 199-... (with some probability) or 149-... (with some other probability). The database can store both of these. When you write queries over this data, you treat them as normal, ordinary queries. When you get an answer, you get an answer with some probability distribution: If you're asking for someone with SS# 199-..., then you'll get the answer Bob (with the corresponding probability).

Probabilistic DBs are pretty cool. Unfortunately, they haven't managed to get much traction beyond the research community. They've been applied here and there (including in some of my own work), but as of this time, no major DB vendor supports ProbDB functionality. Why?

To answer that question, we first have to understand why people would use a probabilistic database. We have to start with sources of uncertainty. Most probabilistic database work attempts to address one (or both) of two types of uncertainty:

Noisy Data - Your data gathering process is flawed (e.g., using OCR software or web-scraping techniques). The data you have contains typos, omissions, or other mistakes. A thorough data-cleaning could potentially fix these errors, but you lack the necessary manpower or resources. That is to say that a hypothetical 'clean' version exists. When the data is queried, you want to find the query results most likely to correspond to the query results on this clean version.
Missing Knowledge - The data being queried is derived from a model, and has no corresponding 'clean' version. There are many possible outcomes, each with a varying likelihood. Queries over this type of data are typically the database to make a prediction, and you're typically looking for an expectation or a percentile result.

Although the underlying techniques used to query both types of data are extremely similar, the way users approach both of these types of data are quite different. Over the next two weeks, I'll talk about each of these, and try to understand what's keeping people from using probabilistic databases to address these problems.