The ODIn Lab - The Analizerificationist

There's been a lot of talk lately about "wisdom of the crowd" and "tapping the collective consciousness" and the like, so I figure I might as well weigh with my 2c, by expanding on an idea that came recently in a conversation I had recently with one of my colleagues Jan Chomicki and his student Ying. (Credit should also go to Dieter Gawlick and Zhen Hua Lu of Oracle, who provided inspiration for this discussion)

Recently, especially in high profile events like the US presidential election, classical political punditry has been getting supplemented (and even in some cases replaced) by data mining algorithms. Powerful, and often quite accurate algorithms exist to predict anything from elections, to ball games, to the stock market, to what you will be doing next Tuesday evening at 6:41 PM.

Yet, in spite of the daunting array of algorithmic predictors that exist out there, there's still more to be done. Data mining is almost more of an art form than a science -- Yes, there are practical, general purpose techniques for finding correlations, outliers, and other interesting features of datasets, but ultimately, you need to know (or at least have a general sense of) what you're looking for. A lot of the beautiful work in data mining lies in finding clever ways to apply the general techniques to specific datasets.

So... where does the wisdom of the crowd come in? Well, let's start with tools like Google Fusion Tables, or Yahoo Pipes. Here, we have a pretty nifty mechanisms for doing data extraction, and analysis, even dataset lookup and organization. Can we do any better?

What's missing from these systems is a way of organizing the derivation process. So you've created a great visualization, and maybe you've even shared it with your friends. Now how can we take your efforts and use them to benefit even more people?

Let's say you have an idea. You think you know exactly how to predict the next election, but it will require a lot of data. What do you need to do? Well, first, you'll have to find and/or extract all that data from content on the internet. Here, fusion table and pipes have you covered. There are some fairly high-quality datasets available, as well as some nifty tools for getting useful data out of the interwebs. But now that you have it, you'll still need to massage it a bit.

Fortunately for you, it's quite likely that someone else has had to do data manipulations on similar datasets. It would be quite useful to have a system that could point you towards such efforts on the part of other people so that you might base your own efforts on theirs. As an added benefit, it might be possible to piggyback on the computational efforts already expended for the prior attempt(s) at massaging similar datasets.

Now that the data is in the right form to be analyzed, there's still that pesky analysis to be done. Here, once again, the system has the potential to help. What questions have other people asked about similar data? What kind of aggregate values might be useful. What kind of visualizations might be appropriate. Are there mash-ups that people have assembled out of similar data (google maps as the most general example). What even qualifies as "similar" data?

In fact, this works from both directions. Let's say you know what kind of information you're looking for. How could you ask the system for strategies that other people have applied to get similar answers? How would you even indicate what you're looking for to the system in the first place?