What's Wrong With Probabilistic Databases? (Part 3)

Two weeks ago, I introduced the idea of probabilistic databases.  In this last installment in this little miniseries, I'm going to talk about the second major use of probabilistic databases: dealing with modeled data.

Unlike last week where we talked about having missing or erroneous data, where there is some definitive ground-truth, a probabilistic model attempts to capture a spread of possible outcomes.  Not only is there no ground truth, there usually won't be (at least not so long as questions are being asked).

That's not to say that there's no overlap between modeled and erroneous data, just that there's a different mentality about how this data is used.  In this case, queries encode scenarios rather than questions.  

That is to say that a probabilistic database must take its uncertain inputs from somewhere.  At some level, there has to be a probabilistic model (or more likely, several) passed as input to the query.  Even if the probabilistic database is capable of filling in any parameters that the model needs, someone still had to sit down and figure out the general framework of the model.  This role generally falls to someone with a background in statistics.  

This is where the problems come in.  The machinery required to get even a relatively simple model off the ground is usually pretty extensive.  Even something as simple as a gaussian distribution can require days, or even weeks of validation against test data.  So, if you want to ask questions about your nice, simple, elegant model, you're not going to want or need the complex machinery of a database.

That said, where the machinery does come in handy, is when you need to integrate multiple models, or to integrate your model with existing data.  A simple example I used in a paper a while back was for capacity planning: One (simple) model gives you the expected capacity (e.g., CPU) of a server cluster at any given time over the next few months (e.g., accounting for the probability of failures), while a second (also simple) model gives you the expected demand on that cluster.  Each of these can be tested, analyzed, and independently validated.  Then, these models can be combined to provide a single model for the probability of having insufficient capacity on any given day.  This relationship can be represented as an extremely simple SQL query, and then executed efficiently on a probabilistic database.

In short, probabilistic databases can be used as a sort of scaffolding to combine multiple data sources (both real and modeled) together, to build more complex models.  

So what's missing?

Given all of this, I think probabilistic database techniques could be adapted easily.  The only real challenge standing in our way at this time is interfaces.  How do we present end-users with an interface that is not only as powerful as the tools they're used to working with, but is also similar enough that the learning curve is minimized.