Two weeks ago, I introduced the idea of probabilistic databases. In this last installment in this little miniseries, I'm going to talk about the second major use of probabilistic databases: dealing with modeled data.
Unlike last week where we talked about having missing or erroneous data, where there is some definitive ground-truth, a probabilistic model attempts to capture a spread of possible outcomes. Not only is there no ground truth, there usually won't be (at least not so long as questions are being asked).
That's not to say that there's no overlap between modeled and erroneous data, just that there's a different mentality about how this data is used. In this case, queries encode scenarios rather than questions.
That is to say that a probabilistic database must take its uncertain inputs from somewhere. At some level, there has to be a probabilistic model (or more likely, several) passed as input to the query. Even if the probabilistic database is capable of filling in any parameters that the model needs, someone still had to sit down and figure out the general framework of the model. This role generally falls to someone with a background in statistics.
This is where the problems come in. The machinery required to get even a relatively simple model off the ground is usually pretty extensive. Even something as simple as a gaussian distribution can require days, or even weeks of validation against test data. So, if you want to ask questions about your nice, simple, elegant model, you're not going to want or need the complex machinery of a database.
That said, where the machinery does come in handy, is when you need to integrate multiple models, or to integrate your model with existing data. A simple example I used in a paper a while back was for capacity planning: One (simple) model gives you the expected capacity (e.g., CPU) of a server cluster at any given time over the next few months (e.g., accounting for the probability of failures), while a second (also simple) model gives you the expected demand on that cluster. Each of these can be tested, analyzed, and independently validated. Then, these models can be combined to provide a single model for the probability of having insufficient capacity on any given day. This relationship can be represented as an extremely simple SQL query, and then executed efficiently on a probabilistic database.
In short, probabilistic databases can be used as a sort of scaffolding to combine multiple data sources (both real and modeled) together, to build more complex models.
So what's missing?
- Langauge Support: Although many statisticians are comfortable working with SQL, this is not the case for everyone who uses probabilistic models. Languages like R, Python, Java, and C++ are far more common, and less alien to researchers and model-builders. There has already been some work on integrating these languages with database techniques. The Scala guys are working on improvements that let you translate code written using certain fragments of Scala into equivalent database queries. There's no reason that we can't do something similar with Python. Similarly, there have been numerous efforts to translate R into some form of relational algebra.
- Efficiency: Over the years, database research has become synonymous with work on monolithic one-size-fits-all database systems. Most work with nontrivial-models already requires extremely expensive monte-carlo methods, and model-builders are often reluctant to delegate the task of hand-optimizing their code to an automated system that they perceive (often correctly) as being less efficient. We need ways to give them good performance out of the box, with a minimum amount of coding overhead and setup. If this performance is insufficient, we need to make it possible for them to seamlessly transition to an environment where they can fine-tune the evaluation strategy, again, without needing to learn anything that they don't already know.
- Interfaces: R provides a number of useful analytic and visualization tools right out of the box. While I suspect that no probabilistic database will be quite as complete in the short term, we need to get there before people will start looking seriously at probabilistic databases as an effective analytics and probabilistic modeling tool.
Given all of this, I think probabilistic database techniques could be adapted easily. The only real challenge standing in our way at this time is interfaces. How do we present end-users with an interface that is not only as powerful as the tools they're used to working with, but is also similar enough that the learning curve is minimized.