Never tell me the odds

A while back, I had a series of articles on probabilistic databases, and shortcomings thereof.  As a quick recap, probabilistic databases are databases that allow you to express data in terms of probability distributions instead of precise values.  Such representations have a number of potential applications, such as developing and analyzing hypothetical "what-if" scenarios, or avoiding information loss due to errors in data (e.g., if the data comes from OCR software).

One of the conclusions that I reached was that people don't like working with probabilities.  Qualitative results are typically more meaningful to an end-user than quantitative ones.  Worse still, unless your data comes from some sort of automated source (like OCR software), how probabilities should be assigned is often unclear.  This is something statisticians get paid big money to do.  Expecting end-users to arbitrarily assign probabilities to data that they're not completely certain about is silly.

So... where does this leave us?  Well, fortunately, a lot of work in the probabilistic database area (especially more recent stuff like [1,2,3]) leaves the exact nature of the underlying probability distributions open to the end-user.  Conceptually, there's nothing to stop us from sticking something more qualitative in its place.  The question is what?

Here's one thought.  Users may not have a good sense of assigning precise probabilities, but they can certainly tell you whether a data value is definitively correct, or just a guess (maybe even something more, like an "educated guess" or a possibly incorrect fact", but let's keep things simple).  In fact, you can get lots of users to give you this kind of information -- different users might even have differing guesses or "definitive" values.  When queries are posed on the data, you might get many possible outputs -- different guesses (or definitive values) can each produce a different query output.  Now each output can be annotated with the set of users who support (or contradict) it.

This effectively forms a lattice of outputs, providing at least a partial order over outputs.  We can do things like give a skyline of the most likely answers.  We can use techniques like web of trust to find answers from people a user is likely to support, or use various measurements of past accuracy to identify users who are likely to provide accurate guesses.  If we have a way of validating guesses (e.g., ground truth eventually becomes available), users can also be ranked.  Low performing users might even be identified and contacted with suggestions about how to improve their guesses.


[1] Green, T.J. et al. 2007. Provenance Semirings. (New York, New York, USA, 2007), 31–40.

[2] Huang, J. et al. 2009. MayBMS: a probabilistic database management system. (New York, New York, USA, 2009), 1071.

[3] Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. (2010), 157–168.