CSE 501: Microkernel Notebooks

Microkernel Notebooks

Oliver Kennedy

Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...

But first...

Databases?

openclipart.org

Data Structures?

Adapted from CC BY-SA 3.0, Wikimedia Commons

Adapted from Wat; Gary Bernhardt @ CodeMash 2012

CSE 562; Database Systems

A bit of operating systems
A bit of hardware
A bit of compilers
A bit of distributed systems

Applied Computer Science

For example...


      CREATE VIEW salesSinceLastMonth AS
        SELECT l.*
        FROM lineitem l, orders o
        WHERE l.orderkey = o.orderkey
        AND o.orderdate > DATE(NOW() - '1 Month')


      SELECT partkey FROM salesSinceLastMonth
      ORDER BY shipdate DESC LIMIT 10;


      SELECT suppkey, COUNT(*)
      FROM salesSinceLastMonth
      GROUP BY suppkey;


      SELECT DISTINCT partkey
      FROM salesSinceLastMonth


      def really_expensive_computation():
        return [
          expensive_computation(i)
          for i in range(1, 1000000):
          if expensive_test(i)
        ]


      print(sorted(really_expensive_computation())[:10])


      print(len(really_expensive_computation()))


      print(set(really_expensive_computation()))


      def really_expensive_computation():
        return [
          expensive_computation(i)
          for i in range(1, 1000000):
          if expensive_test(i)
        ]

      view = really_expensive_computation()


      print(sorted(view)[:10])


      print(len(view))


      print(set(view))

Opportunity: Views are queried frequently

Idea: Pre-compute and save the view’s contents!

Btw... this idea is the essence of CSE 250.

openclipart.org

When the base data changes,
the view needs to be updated too!


      def init():
        view = query(database)

Our view starts off initialized

Idea: Recompute the view from scratch when data changes.


      def update(changes):
        database = database + changes
        view = query(database) # includes changes

CC BY-SA 3.0, Wikimedia Commons


      def update(changes):
        view = delta(query, database, changes)
        database = database + changes

`delta`	(ideally) Small & fast query
`+`	(ideally) Fast "merge" operation

Intuition

$$\mathcal{D} = \{\ 1,\ 2,\ 3,\ 4\ \} \hspace{1in} \Delta\mathcal{D} = \{\ 5\ \}$$ $$Q(\mathcal D) = \texttt{SUM}(\mathcal D)$$

$$ 1 + 2 + 3 + 4 + 5 $$

$Q(\mathcal D+\Delta\mathcal D)$ $\sim O(|\mathcal D| + |\Delta\mathcal D|)$

$10$$+ 5$

$\texttt{VIEW} + SUM(\Delta\mathcal D)$ $\sim O(|\Delta\mathcal D|)$

Get off my database's lawn, punk kids

Why Jupyter Sucks

Microkernel Notebooks

Oliver Kennedy

Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...


  import pandas as pd


  df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv")
  df


  df.groupby("County").count()

...

Python 3.9.7 (default, Sep 10 2021, 14:59:43) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv") >>> df

FMID MarketName ... WildHarvested updateTime 0 1000519 Alexandria Bay Farmers Market ... N 2/1/2021 11:02:22 AM 1 1021329 Aurora Farmers Market ... Y 1/30/2021 6:24:08 PM 2 1002064 Belmont Farmers Market ... N 1/27/2021 9:03:15 PM 3 1021262 Broome County Regional Farmers Market ... Y 1/5/2021 10:02:05 AM 4 1021202 Canal Village Farmers' Market ... N 9/9/2020 7:55:23 PM .. ... ... ... ... ... 82 1020021 Waterloo Rotary Farm Market ... N 8/3/2020 2:28:33 PM 83 1000384 Webster's Joe Obbie Farmers' Market, Inc. ... N 1/5/2021 10:18:30 AM 84 1002177 West Point-Town of Highlands Farmers Market ... N 8/2/2018 12:58:13 AM 85 1019038 Woodstock Farm Festival ... N 4/4/2018 11:27:02 AM 86 1007259 Yates County Cooperative Farm and Craft Market... ... N 2/3/2019 12:29:07 PM [87 rows x 59 columns] >>>

Cells are code snippets that get pasted into a long running kernel

I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)

Evaluation Order ≠ Notebook Order

... but why?

In a monokernel...


  import pandas as pd


  df = pd.read_csv("really_big_dataset.csv")


  test = df.iloc[:800]
  train = df.iloc[800:]


  model = train_linear_regression(train, "target")


  evaluate_linear_regresion(model, test, "target")


  import pandas as pd


  df = pd.read_csv("really_big_dataset.csv")


  test = df.iloc[:500]
  train = df.iloc[500:]


  model = train_linear_regression(train, "target")


  evaluate_linear_regresion(model, test, "target")

Q1: Which cells need to be re-evaluated?

Idea 1: All of them!


  import pandas as pd

  df = pd.read_csv("really_big_dataset.csv")
  test = df.iloc[:500]
  train = df.iloc[500:]

  model = train_linear_regression(train, "target")

  evaluate_linear_regresion(model, test, "target")

CC BY-SA 3.0, Wikimedia Commons

... but df is still around, and you can "re-use" it.

Idea 2: Skip cells that haven't changed.

... but you need to keep track of this.

Idea 3: Pull out your CSE 443 Textbooks

Data Flow Graph

Cell 3 changed, so re-evaluate only cells 4 and 5

... but

...


  model = train_linear_regression(train, "target")


  evaluate_linear_regresion(model, test, "target")


  df = pd.read_csv("another_really_big_dataset.csv")
  test = df.iloc[:500]
  train = df.iloc[500:]

df has changed!

We want to "snapshot" df in between cells.

The kernel runs, snapshots its variables, and quits.

Microkernel Notebooks

Lots of small "micro-kernels"
Explicit inter-cell messaging
Messsages are snapshotted for re-use

Demo

https://vizierdb.info

https://github.com/VizierDB/vizier-scala