Vizier Workflows (rant)

I'd like to talk a little about abstractions for communication. In particular, I want to talk about a favorite workhorse of the data science community these days: Jupyter notebook. For those unfamiliar with it, Jupyter users work with blocks of code called "cells." Each cell has an opportunity to produce a result, which is then displayed inline, immediately after the cell. This makes it a lot easier for users to break up complex tasks, showing intermediate results inline with the rest of the code.

Let's spend a little time digging in to how this works. For each language that supports Jupyter (python, scala, ruby, and more...), the developers have created a way to snapshot the language's state: global variables, runtime information, file handles and more. They call this a kernel. When you execute a cell, Jupyter loads the kernel, runs code against it, and saves the result.

This means that if you want to have code in one cell talk to code in another cell, the natural way to do it is to create a global variable. The fundamental communication abstraction in Jupyter is the kernel. On the one hand, this is a very powerful abstraction: anything that you can represent using a global variable in Python can be sent. On the other hand, it also means that the main way to communicate is through language-specific binary blobs.

At UB, we're working with NYU and IIT on a data exploration tool called Vizier. Expect to hear more about Vizier here in the coming weeks and months, but what I want to focus on right now is the fact that cells in Vizier talk through tables (or DataFrames or Relations, if you like). The fact that they're tables isn't even all that important; What we care about is the fact that they're in a standardized format that Vizier understands. This is why data debugging in Vizier is easier, and why we expect to be able to provide some powerful query optimization down the line. Again, more on each of those as they develop.

What I want to focus on today is interoperability. Because all communication in Vizier happens through tables, you can write a python script that transforms data in one cell and a SQL query over the same data in the next. Better still, it means that we can allow direct manipulation of data: For Vizier, we're developing a new language called Vizual. Every expression in Vizual corresponds to an action in a spreadsheet (rewriting a cell, adding a formula, etc...). So, you can write a python script, manually fine tune the output table as a spreadsheet, and then query the results. None of that would have been possible if the underlying communications abstraction was opaque to Vizier.


This page last updated 2024-12-03 16:56:13 -0500