Data curation is the act of taking data and massaging it into a form that can be analyzed. There's a common saying among DBAs and Librarians that data curation is the biggest time sink of data management. I can certainly cite a number of examples of this. My wildlife biologist girlfriend spends almost as much time organizing and inputting data as she does out in the field collecting it, or analyzing it. The kind folks working on the DBLP do nothing but data curation. If you squint a bit, data mining can be thought of a specialized form of data curation, where signals (usable/analyzable data) is extracted from noisy, messy data.
In short, this is an area that a lot of people spend a lot of time worrying about. It's also an area on which a lot of people have expended a considerable amount of effort.
Why is that?
Although Data curation is an extremely repetitive task (suggesting that it might be ideal for computers), the kernel of this repetition, the very heart of this task is something entirely nontrivial: data validation.
- Do both of these datasets contain information about John Doe?
- Is John Doe the same person as Johnny Doe?
- Is "The House at the End of the Row, Birminghamshire, England" a valid address?
- How do I deal with John Doe not having a home phone number?
When analyzing data, just like writing code, we make certain assumptions about the data. For example, "This dataset contains one row for each unique individual". If these assumptions are invalid, then our analysis will be incorrect. In addition to getting the data into the right, readable format, its primary task is to ensure that the assumptions that analysts make about the data are valid.
Of course, this requires us to explicitly declare these assumptions. Databases have a mechanism for this, called constraints (e.g., Primary Key constraints, Foreign Key constraints, Validation Triggers, etc...). However, even these are flawed.
Let's take the example I mentioned just now: "This dataset contains one row for each unique individual" This is a nontrivial example to encode. How does a database figure out whether two individuals are identical? "Joe" and "Joey" could be different names for the same person. Deduplication is something people have studied for a very very long time, and even now they don't have a particularly good solution that's 100% correct all the time.
Moreover, what happens when the database detects such a violation. The typical solution is for the database to simply reject any insertion that would cause the constraints to be violated.
This typically annoys users, who, at least in the short term, just want to load their data into the database. Consequently, constraints are used quite infrequently, and then, usually only for values that the database itself generates (e.g., entity ids/counters). Specifying more complex constraints is just out of the question.
Although constraints automate the data validation process, they are insufficient. There's a clear tension between how tightly we specify these constraints (e.g., An address is always a number, followed by a street, followed by a newline, followed by a city, etc ...), and how usable the database is. Extremely tight constraints are convenient for the analysts and database programmers, since they can make stronger assumptions about the data... but make it difficult to insert data into the database (and hard to handle corner cases, like addresses in rural England). Weak constraints are the exact opposite.
Finding the right balance between strong and weak constraints is hard. It's a large part of data curation. How much do you automate, and how much do you put on the analysts?
Is there a middle ground? Are there other ways of creating constraints that are strong enough to satisfy the analysts, but that don't make inserting data into the database a miserable experience? How do we alert the analysts when a corner case appears in the database that violates their assumptions about the data?