Gathering Data, Interactive Programming, and Analysis

Data exploration is an interactive process. Let's say I have a dataset… I want to ask questions about it.  Often though, I'm not going to have a precise idea of what questions I want to ask, even if I do have a vague sense of them.  I want to be able to explore the data.

So what's standing in the way of me doing that?

Gathering the data: It's possible that the data is not immediately available and needs to be gathered.  Even if I know what I'm looking for, I might not immediately have access to the data that I'm looking for.  Before anything, I need to find the data that I'm interested in, and (if necessary) transport it to somewhere that allows me to compute over it.

Structuring the data: Data pulled from the outside world needs to be put into a structured form before any sort of automated analysis.  This may be as simple as parsing (e.g., a CSV file), or more complex: I might be able to extract all manner of features from a log file, for example.  I might split based on records, based on lines, or even based on sets of records.  I might be interested in writing a parser that pulls out certain features from the log entry -- the timestamp, the message, or the component causing the alert.  This is a bit of an ad-hoc process -- I may only be interested in specific patterns and subsets of the data now, but that might change as I explore more of the data.

Cleaning the data: Even after I've imposed some structure on the data, there's no guarantee that the data is 'correct'.  Strange entries, outliers, and missing or corrupted data will make any results I obtain useless.  At this stage, one typically goes through a set of sanity checks, examining schema warnings from the previous stage, asserting constraints like key dependencies, and validating against secondary data sources.  I may also want to apply my domain knowledge; Past experiences may have given me a sense of what could go wrong with my data collection process.

Query processing: Finally, I'm ready to actually manipulate the data.  This means transforming the data into a form that matches what you need -- merging datasets, rotating/pivoting the data, and/or filtering out entries of interest, for example.

Visualization: A step in the process that's often associated with this last query processing stage is summary and visualization; Obtaining aggregates, samples, and/or graphical representations of the data is a crucial part of the entire analytical development process.  (1) As I'm gathering the data, I need to be able to see bits and pieces of it so that I can be sure that it's what I'm looking for, (2) As  I'm structuring the data, I want to make sure that my regular expression and/or parsing scheme is correct, (3) As I'm cleaning the data, I want to see/visualize outliers, and (4) obviously, I want to see the results.


Really, each of these aspects of analysis is interrelated.  One bounces back and forth between different stages, gathering more data, parsing out more fields, cleaning, etc… A strong analytical pipeline relies on being able to see the data quickly, see results even if they're only estimates, and then go back to iterate on your analysis.  

How do we achieve this?  What kind of interfaces can we build to improve feedback, and to anticipate the user's needs.  What infrastructures are needed to support this kind of anticipatory computation?

This page last updated 2019-06-28 15:47:51 -0400