What's Wrong With Probabilistic Databases? (Part 1)

A large chunk of my graduate work has to do with a subfield of database research called probabilistic databases.  

The idea is simple: Most databases store precise values.  A row of a normal database might indicate that Bob's SS# is 199-..-....

A probabilistic database allows users to provide data specified by a probability distribution.  Perhaps Bob was a little sloppy filling out a form, and the OCR software couldn't tell whether he intended to put down 199-... or 149-...

It might not be able to determine a precise value for that slot, but it can tell you that Bob's SS# is either 199-... (with some probability) or 149-... (with some other probability).  The database can store both of these.  When you write queries over this data, you treat them as normal, ordinary queries.  When you get an answer, you get an answer with some probability distribution: If you're asking for someone with SS# 199-..., then you'll get the answer Bob (with the corresponding probability).  

Probabilistic DBs are pretty cool.  Unfortunately, they haven't managed to get much traction beyond the research community.  They've been applied here and there (including in some of my own work), but as of this time, no major DB vendor supports ProbDB functionality.  Why?

To answer that question, we first have to understand why people would use a probabilistic database.  We have to start with sources of uncertainty.  Most probabilistic database work attempts to address one (or both) of two types of uncertainty:

Although the underlying techniques used to query both types of data are extremely similar, the way users approach both of these types of data are quite different.  Over the next two weeks, I'll talk about each of these, and try to understand what's keeping people from using probabilistic databases to address these problems.