Re-using work in data integration

Data integration is a huge problem. There's a ton of work out there on automating the process of merging two datasets into a unified whole, but most of it misses one important factor: Exceptions are the norm in data integration. That means that data integration is a labor intensive task, involving everything from encoding standard translations (e.g., ℉ - 32 * 9/2 = ℃), dictionaries (e.g., NY = New York), and more complex relationships (e.g., geocoding street addresses). Worse, once datasets A and B are integrated, integrating dataset C is nearly as much work. Maybe you have some code left over from integrating A and B (and hopefully your student/employee is still around to explain it to you), but you really need to sit there to try to figure out which bits of that code can be re-used... or you do everything from scratch.

This is why, I got very excited when I ran into some work by Fatemeh Nargesian on searching for unionable datasets.
The idea is simple: you index a data lake, hand it a dataset, and it figures out which datasets in the lake have "similar" columns (based on a clever use of word embeddings). Enough similar columns, and there's a good chance that you can just union the datasets together.

My students Will Spoth and Poonam Kumari got to talking with Fatemeh and me about how we could use this idea to make it easier to re-use data integration code --- basically, how could we make it easier to re-use integration work. Our first steps towards this goal just got accepted at HILDA. Our approach, called Link Once and Keep It (Loki) is also simple: When you integrate two datasets together, you record the translations that got you from the dataset to a common schema. This is something that can easily be done in our prototype, provenance-aware notebook Vizier. For example, we might record how body temperature in one dataset was translated from ℉ to the ℃ used in the other dataset, or the dictionary translation of state abbreviations (NY) to full state names (New York) used in the other dataset.

Now that we have one mapping, anytime we need to translate ℉ to ℃ or to expand abbreviated state names, we have the logic needed to pull it off. What remains is to figure out when to propose the translation to the user. This is where Fatemeh's work on unionability comes in: Whenever two columns are "similar", there's a good chance they're of the same type and that the same mappings apply. We took the opportunity to define a new similarity metric for numeric types, based on the distribution of values in the data. Unlike the prior approach based on word embeddings, this is far more likely to give false positives, but in this setting that's ok, since our goal is only to find and suggest translations to the user.

Loki combines this with a graph-based search for chains of translations that can be used to translate a source attribute family into a target attribute family. This will allow Loki to answer two classes of queries: (1) What transformations will get me from a source dataset to a target schema, and (2) Is there a schema that I can map two datasets into with minimal work. While Loki is still in the exploratory prototype phase, we hope to be able to release it one day as a slowly growing repository of translation rules.


This page last updated 2024-12-03 16:56:13 -0500