Last week, I introduced the concept of probabilistic databases: databases that store values characterized by probability distributions, and not (necessarily) by specific values. Although a pretty cool, and potentially quite useful idea, there are a number of practical concerns that have prevented it from gaining traction. This week, we explore one class of problems that probabilistic databases are ideally suited for: noisy data.
Data is only useful if you can analyze it -- ask questions about it. Problem is, that very few data-gathering pipelines are entirely perfect. Data can be missing. Data can contain typos. Data can contain measurement error. And on top of it all, even if a data source is perfect, god help you if you have to combine it with another data source. Integrating multiple data sources means dealing not only with inconsistent formatting, but also inconsistent data values (The same person could be referred to as Mary Sue, Ms. M Sue, Mrs Sue, or any of a practically infinite number of variations on the same theme). A whole research area (typically called entity resolution) has sprung up around this problem.
In short, before using any dataset (or when merging two datasets), it's typically necessary to go through a (often time-consuming) data-cleaning process. Often, this process can be automated. You can call out obvious errors: duplicates of values that should be unique, values out of bounds, improperly formatted expressions, etc... Many of these issues can be fixed automatically. Formatting mismatches between different datasets can be easily fixed by translating to a single common format.
Unfortunately, automated processes can only take you so far. If two people appear with the same social security number, then clearly at least one of them is wrong. But typically, an automated process can't decide which is correct, nor can that process decide what kind of number to assign to the person who now has no identifier. Typically, one of three things happens at this point:
- Immediately punt to the user: A common example of this is key constraints in traditional databases. The datastore simply won't allow data that it knows to be unclean to be entered into the system. This approach ensures that data is correct before anyone asks any questions about it, but necessitates end-users to put a huge up-front effort into ensuring data quality before a single question can be asked.
- Guess: This happens often in situations where an automated system can efficiently compute the probability that a particular interpretation is correct, like handwriting recognition or sentence parsing. The system settles on one specific way of interpreting the data (the one with the highest probability), and discards the rest. Ironically, this can be just as much a source of data errors as any other data gathering process if the guessing algorithm isn't perfect.
- Ignore it: Failing all else, you can simply ignore the problem. You implicitly accept that answers to your questions may be erroneous, but don't especially care.
In short, either you put a huge amount of effort in upfront to clean your data, or you deal with mistakes in your answers.
Probabilistic databases aren't a magic bullet. They can't magically make your data clean, or fix the mistakes in your answers. What they can do, however, is tell you how much of a mistake there is. And you don't even have to do anything different. You can just query your data as if it were normal, ordinary data. All of the trickery for dealing with uncertainty happens under the hood, except that you get a probability value as your output.
So where's the problem? Why aren't probabilistic databases being used more aggressively?
As I see it, there are two issues at play here for the general populace:
- People don't know what to do with probabilities. Statisticians aside, very few people know how to deal with probabilities. If someone gets a response that is 75% accurate, they're not going to generally want to perform a complex risk analysis. Either they trust the result or don't. In other words, guessing is usually sufficient here, because ultimately the user is interested in the most probable result anyhow (which isn't always guaranteed by guessing).
- People don't know how to define probabilities to begin with. Again, statisticians aside, very few people can build good statistical models. Sometimes your data comes with probabilities already associated with it, but more likely than not, the average data-user won't have a good sense of how to define their automated data cleaning processes probabilistically.
In short, the problem with applying probabilistic databases to the challenge of noisy data is the probabilities. People are used to dealing with fuzzier notions: "Certainly Not", "Unlikely", "Possibly", "Likely", Certainly So".
So what's the takeaway from all of this? Well, I'm not entirely certain. I think probabilistic database research needs to start looking at ways of isolating users from the specifics of the probabilistic distributions underlying the system. Instead of presenting users with query results and probabilities, we need to give users a more intuitive way of visualizing what possible outputs there are. Rather than giving users specific confidence values, we need to give users a more intuitive notion of how to interpret that confidence value.
Better still, we need to provide the user with things that they can do to improve the confidence level; Instead of immediately punting to the user when a data error occurs, let the user run queries on the noisy data, and then point them at the specific cleaning tasks that they need to perform in order to get better results.
And of course, we need to give users better tools for automating their data cleaning processes -- tools that natively integrate with probabilistic database techniques. Tools that know how to associate probabilities with the data they generate.
Next week, we look at a second class of problems that probabilistic databases can be used to address: modeling.