The ODIn Lab - Finding truth in the bits

What is truth, and what is data?

At the very least, they're different. Ask any scientist, and they'll caution you about conflating the two: Data includes measurements and observations, mere points and samples of the whole of the universe.

This may seem a bit philosophical, but my point is that, while there is often a strong correlation between data and truth, the two are distinct. Even in the best case, when working with data of perfect quality, it represents only a subset of a bigger picture. And data is very infrequently of perfect quality. Substantial massaging is often required to get data into a standardized form for analysis. As data is being massaged, assumptions are often made about the data: Floats are cast to integers, Comment fields are dropped or ignored, Extenuating circumstances or outliers are rolled into the core data. The data cleaner's assumptions are being applied to interpret the data.

That's not to say that these transformations are bad. Substantial effort goes into data cleaning efforts. But the fact is that when you run a query on the database, it's important to realize that what you're getting back is data, and not truth.

It might be nice to have a database that acknowledges this distinction.

What would such a database look like?

I envision a database with two (or more) layers, each layer providing a view over the layer below it. The bottom layer would consist of the base data, intact, unchanged, and as-gathered. The uppermost layer would represent "truth". The base data is completely deterministic; We know these values precisely, but the values themselves may be wrong or not representative. As we travel up the levels, we get to progressively lower levels of determinism. Queries run on the higher levels are guaranteed to provide "true" results, but may emit annotated results, ranges of possible results, probability distributions, or simply say "I don't know."

The crucial challenge then, is how do we make such a database usable? How can this process be integrated into a normal data cleaning workflow with minimal changes and/or overheads?