ODIn: Online Data INteractions

Clangd

2026-02-05T00:00:00+00:00

TL;DR: In order to get clangd to run on Ubuntu 24.04 (or, in my case PopOS), g++-14-dev</code> (and more precisely, its dependency libstdc++-14-dev</code>) needs to be installed.</p>

When the C++ compiler is g++, there is an irritating interaction between CMake, clangd, and bear, where clangd may not be able to see libstdc++ header files. One of these tools (bear, I think... although possibly CMake) hardcodes paths to header files. Even though g++ can find headers, clangd won't be able to. To fix this, find a file on which clangd complains about opening files, and run:</p>

clangd</span> --check</span>=</span>path/to/file.cpp
</span></code></pre>
This will dump out a bunch of output, one item of which will be the list of options being passed to g++.  In these, there should be an option like -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14</code>.  The 14 there is the libstdc++ version.  For this, I had to install libstdc++-14-dev</code> to get the appropriate header files.  I imagine in the future, this number will increase.</p>
Future Oliver, or random internet person struggling to find a solution, you're welcome.</p>



Fixing Anubis
2026-01-14T00:00:00+00:00
I finally tracked down (hopefully) the last of my Anubis issues: A stray 503 on the first page load after passing PoW check.</p>
Before giving in and resorting to Anubis, I was trying to rate-limit directly in nginx.  I have no clue how this would turn into a 503, but disabling a stray limit</code> directive that survived the transition seems to have fixed the problem.</p>


NEDB Day 2026
2026-01-14T00:00:00+00:00
The ODIn Lab will be at NEDB_Day_2026 on Jan 16 with two posters!</p>
"Flow-centric Query Evaluation Pipelines" (Victoria, Andrew, Krishna) presents our preliminary efforts to make a scheduler-friendly query evaluation pipeline for our #Draupnir datalog engine.  The key insight behind our work is decoupling state from operators.  By making operators (mostly) stateless, we can inline better, and we can expose IO to the scheduler more efficiently.</p>
"Benchmarking Tabular Representation Models on Longitudinal Data" (Pratik) presents our work on data integration for longitudinal studies.  Longitudinal studies generate a slew of datasets that are almost alike, but not quite.  Coupled with the fact that attributes are identified by prose questions rather than simple identifiers, they aren't a great fit for existing data integration/unionability tools.  We'll specifically be presenting a benchmark, painstakingly adapted from the American National Election Survey, which shows that we need new data integration tools.</p>


Tricking Out My Console
2025-07-20T00:00:00+00:00
Over the past few months, I've gotten onto a bit of a console kick.  Consoles have gotten surprisingly capable since I last really mucked with my workflow, and more of my workflow now actually seems to be possible within the scope of a terminal emulator.  As always, this is a moving target, but here is a snapshot, as of Summer '25, of my efforts to trick out my console.</p>
The Terminal#</span></a>
</h4>
A lot of the more advanced features of some of these apps require more advanced terminal emulators.  There's a wide range out there, but the two I've played with the most are KiTTY</a> and GhosTTY</a>.  Some relevant features of each (for me):</p>

Support for the KiTTY graphics protocol</a></li>
Tiling</li>
Linux support</li>
Extended color</li>
Support for embedded URLs.</li>
</ul>
I've been tending towards GhostTTY of late, largely, but not entirely for aesthetic reasons:</p>

GhosTTY lets me remove the header bar</li>
GhosTTY uses standard GTK widgets</li>
GhosTTY fades out the non-selected tile, making it easier to spot.</li>
GhosTTY handles drag & drop better than KiTTY; the latter pastes file:// URLs into the shell, while the former pastes standard paths.</li>
</ul>
In spite of the occasional glitch (KiTTY has been bulletproof in my experience) this is still enough that I tend to reach for it first.</p>
Bash#</span></a>
</h4>
At some point, I should probably switch my shell to something cleaner, but I haven't been pushed into it yet.  I'm starting to push up against bash's rough edges, but for now it's sufficient, and omnipresent... meaning I can get consistent behavior everywhere.  That said, there are a few hacks:</p>

Aliasing wl-copy</code> and wl-paste</code> to pbcopy</code> and pbpaste</code> (I used to use a mac, sosumi).</li>
Aliasing xdg-open &2>/dev/null</code> to open</code>.</li>
Aliasing ls --color=auto --hyperlink=auto</code> to ls</code> (the latter parameter makes listed items clickable in GhosTTY)</li>
</ul>
Powerline-Rust</a>#</span></a>
</h4>
Based on the python powerline-shell, powerline provides a pretty, informative output on your shell prompt, like output status, git metadata, and a subset of the path.</p>

I have powerline installed as my default shell prompt.</li>
</ul>
batcat</a>#</span></a>
</h4>
Bat is a drop-in replacement for cat and less/more.  It displays files with syntax highlighting, formatting, line numbers, and e.g., color-coding for CSV columns.  It auto-detects when it's connected to a tty, so it can fall back to classic cat behavior the rest of the time.</p>

I aliased bat</code> to cat</code>, and have more/less retrained my muscle memory to reach for it instead of more/less.</li>
</ul>
fzf</a>#</span></a>
</h4>
Fuzzy find is a useful set of tricks for finding items in lists based on an interactive partial substring match.  This is a building block used by other services, and I expect that I'll revisit it to do stuff like e.g., building a console frontend to Pop! shell launcher services, but for now my main uses are:</p>

Overriding bash's normal ctrl-r</code> reverse history search to be a bit more friendly.</li>
Bound to ctrl-t</code> in bash to search for file path completions.</li>
</ul>
Zoxide</a>#</span></a>
</h4>
Zoxide builds on fzf to provide interactive path search.  Notably, it maintains a history of directory paths that you've visited, and will use that to filter its search.</p>

Bound to alt-c</code> to find and jump to an arbitrary path in my home dir.</li>
</ul>
broot</a>#</span></a>
</h4>
I recently started playing with broot; It fills a niche similar to zoxide, but provides a more interactive file exploration mode.  Notably, it can do things like preview files (including image files using KiTTY graphics).</p>

Bound to alt-x</code> to explore the current directory.</li>
</ul>
btop</a>#</span></a>
</h4>
Top, but hecka pretty.  'nuff said.</p>

Aliased to top</code>.</li>
</ul>
lazygit</a>#</span></a>
</h4>
My attitude towards git is that I know the 2-3 commands that I care about, and I use a gui for the rest.  This terminal-based git gui actually seems like it might punt me entirely into a gui.</p>

Aliased to lgit</code></li>
</ul>
difftastic</a>#</span></a>
</h4>
This is a diff engine based on tree-sitter.  Super useful for diffing source code, since it can ignore whitespace.</p>

I have git configured to use this as a diff engine.</li>
</ul>
jq</a>#</span></a>
</h4>
In principle, this is a general purpose query engine for json data... but 90% of my use of it is for pretty-printing json.</p>
Ripdrag</a>#</span></a>
</h4>
In spite of me moving a lot</em> of my workflow into the console, there are times when I need to interact with graphical apps.  GhosTTY already has a number of features that make this process more friendly, like the ability to click URLs and embedded urls.  However, some times the only way to transfer context into an app is via dragging (e.g., adding an attachment in Thunderbird).  For these situations, there's ripdrag.  Running the command pops up a window with every file passed on the command line, allowing them to be dragged into other apps.</p>
Helix</a>#</span></a>
</h4>
I'm still a bit on the fence, but Helix seems to be a solid LSP-based editor.  It has an interface scheme based on VI, which is taking a little getting used to... but at least I don't have to deal with lisp.</p>


Minnowbrook / Aggregation Trees
2025-07-15T00:00:00+00:00
As part of the price of attendance at Kris Micinski's Minnowbrook Datalog Seminar, he asked everyone to produce a blog post.  It's now over two months since the seminar, and I'm out of excuses to produce something, so here goes a random thought that came out of a discussion with Thomas Gilray: Heaps are just specialized Aggregation Trees.</p>
The legend of Max and the DBToaster#</span></a>
</h4>
First, a bit of background.  Our story starts about 14 years ago, with a system called DBToaster</a>.  DBToaster is what's called an incremental view maintenance system.  That is, you give it one or more queries, and it produces some highly optimized code that will dynamically recompute the results of those queries as the inputs change.  What makes the system incremental is that instead of recomputing each query from scratch, every time the input changes, it only needs to recompute how the result changes.</p>
As a very simplified example, say we have a collection of integers (R = {| 1, 3, 2, 1 |}</code>).  The sum of this collection (SUM(R)</code>) is 7.  Now say we insert a new number into the collection: 3.  We have two options:</p>

Insert 3 into R</code> and then compute SUM(R) = 1 + 3 + 2 + 1 + 3</code>.</li>
Compute the update SUM(R) += 3</code> and then insert 3 into R</code>.</li>
</ol>
The latter operation is an option because addition (over the integers, reals, etc...) is a commutative, associative operation.  It's also far far far faster, since we don't need to revisit each and every element of R</code> to compute an updated value.  The neat thing is that this still works when we try to delete an element.  For example, if we want to remove the 2, we again have 2 options:</p>

Remove the 2 from R</code> and then compute SUM(R) = 1 + 3 + 1 + 3</code>.</li>
Computer the update SUM(R) -= 2</code> and then remove 2 from R</code>.</li>
</ol>
This works because addition (over the integers, etc...) has an inverse operation.  That is, for every A</code>, there's a value -A</code> such that A + B + (-A) = B</code>.  This is really</em> neat, because it means that we can store only the sum, and don't need any other additional information to maintain SUM(R)</code> as R</code> changes.  The problem is that, while this works for addition-based aggregates (e.g. SUM, COUNT, AVERAGE), not every aggregate has an inverse.  For example, in general, given the value max(A, B)</code>, there's nothing you can combine via max</code> to recover B</code> after "removing" A</code>.</p>
The typical 'fix' for this problem is to maintain more information.  Specifically, for max</code> (or min</code>) we could maintain the entire collection of integers in R</code> so that we could remove individual elements.  However, every time the underlying collection changes, we need to recompute the max value, something that, in general, would take us O(|R|)</code> work (one 'step' per element in R</code>).  That is... not ideal.</p>
The Heap#</span></a>
</h4>
Fortunately, there's a data structure that works out well with max</code> (and min</code>): The Heap.  A heap is a neat little tree-shaped data structure (there are some fancy ways of encding a heap into an array that I won't get into here).  A heap stores a collection, with each node of the tree storing exactly one collection element (repetition allowed).  There are a few quirks to how a heap is constructed, but the basic requirement is that the value stored at every node of the tree is greater than all of its descendents.  For example, the following heap stores the collection {| 1, 1, 2, 2, 3, 5 |}</code>:</p>
     5
</span>   /   \
</span>  2     3
</span> / \   /
</span>1   2 1
</span></code></pre>
Notice how every node's value is greater than that of the nodes below it.  Now, the key trick of the heap is that:</p>

You can always read out the greatest element of the collection in constant time by looking at the root of the tree.</li>
If you remove an element from the collection, you can restore the heap in a logarithmic (O(log |R|)</code>) number of steps.</li>
You can insert an element into the heap in a logarithmic (O(log |R|)</code>) number of steps.</li>
</ul>
Note that this is much</em> better than having to perform a linear number of steps to update the maximum value naively.</p>
Getting back to DBToaster, about a decade ago, I was thinking about how to add support for max, min, and more.  Obviously we could use a heap structure to solve this problem, but heap only gets us max</code> (and min</code>).  I was wracking my brain looking for ways to generalize the heap to other aggregates.  It wasn't until a chat with Thomas Gilray that the solution hit me: Aggregation trees.</p>
Aggregation Trees#</span></a>
</h4>
An aggregation tree (which goes by other names as well) is a tree-shaped data structure that encodes a collection.  In an aggreation tree, (i) every leaf node stores an element of the collection; and (ii) every inner node stores the aggregate value of every collection element it contains.  Using our example heap collection, and the SUM</code> aggregate, we might get the tree:</p>
        14
</span>      /    \
</span>    10      \
</span>   /   \     \
</span>  3     7     4
</span> / \   / \   / \
</span>1   2 2   5 3   1
</span></code></pre>
As with the heap (assuming we use some balanced tree):</p>

We can read out the overall aggregate value in a constant number of steps from the root.</li>
We can insert new elements into the tree and update all inner nodes in a logarithmic number of steps.</li>
We can remove elements from the tree and update all inner nodes in a logarithmic number of steps.</li>
</ul>
The Insight#</span></a>
</h4>
Now let's see what happens when we use an aggregation tree with the max</code> aggregate.</p>
         5
</span>      /    \
</span>     5      \
</span>   /   \     \
</span>  2     5     3
</span> / \   / \   / \
</span>1   2 2   5 3   1
</span></code></pre>
If you look closely, each inner node stores an exact</em> copy of one of the leaf nodes.  This is a bit redundant.  What if we just stored one copy of each at the point where the value is stored:</p>
     5
</span>   /   \
</span>  2     3
</span> / \   /
</span>1   2 1
</span></code></pre>
Hey!  We're back to the heap!</p>
That's it.  That's the magical insight: A heap is just a highly specialized aggregation tree.  To be precise,</p>

An aggregation tree can be used for any commutative monoid</li>
If the commutative monoid is a group, we only need the root element</li>
If the commutative monoid is absorptive</em>, we can use a heap.</li>
</ul>
Until next time!</p>


FastPDB
2025-06-01T00:00:00+00:00
Aaron Huber presented his paper on FastPDB</a> at this year's SIGMOD, last week.  Here's a quick rundown, if you missed it!</p>
Probabilistic databases, for those of you who may not have heard of them, are a type of database system that can cope with data that is not precisely defined.  For example, a typical Amtrak train schedule might say</p>
"Departs at 5:00"
</span></code></pre>
If that schedule were stored in a probabilistic database, you'd instead see</p>
"Departs at 5:00" 50%
</span>"Departs at 5:10" 10%
</span>"Departs at 5:20" 10%
</span>...
</span>"Departs at 10:00" 5%
</span></code></pre>
Why might you possibly want this?  Well, it turns out that most data sucks.  Unfortunately, people implicitly trust data that pops out of a computer, and if we're going to have any hope of changing that, we need the computer to be able to communicate uncertainty about information.  That's where probabilistic databases shine, because they can give you answers couched in established statistics.  For example, the database might be able to tell me that I have a 70% chance of leaving by 5:20.</p>
Unfortunately, probabilistic databases are... not ideal.  Even leaving aside the difficulty of getting meaningful data into the system, or helping users to understand results framed as probabilities (both projects ODIn students have worked on), there's the foundational problem of performance.  Probabilistic databases, even ones heavily optimized for speed, tend to be orders of magnitude slower than regular deterministic databases.</p>
FastPDB represents the first steps towards fixing this problem.  Aaron first asked whether there was something fundamental about probabilistic databases that made them slower.  Probabilistic databases that work with sets are known to be slow (#P runtime complexity), but the general consensus has been that bag probabilistic databases are "fast" (P-time).  However, it turns out that in general, the extra power you get from managing probabilities does come at a cost: Bag Probabilistic Databases (modulo the validity of several well-established complexity assumptions) scale super-linearly in the runtime of deterministic queries.</p>
Given this result, the logical next step was to build an approximation algorithm.  Typical probabilistic algorithms operate in two phases: first generating provenance for the query, and then analyzing the provenance to cope with correlated variables, which make it difficult to efficiently produce query results.  Again, looking at the complexity of different approaches, we settled on a strategy based on Provenance Circuits, and showed that the complexity bottleneck was in sampling from the circuit, rather than constructing it.  Aaron implemented an algorithm for estimating the expected count of a bag probabilistic query result over such a circuit, and it was off to the experimental benchmarks...</p>
... where the entire process failed.  Complexity-wise, building a provenance circuit is fast... but it has a really</em> big constant factor.  So we started exploring for other options, and wound up making the observation that the structure of our sampling algorithm very closely mirrored the structure of an approximate query processing algorithm called WanderJoin.  Indeed, sampling directly from the circuit was no different from sampling directly from the query result using WanderJoin.  So Aaron implemented a simple query-query transpiler (using GProM) over XDB (the implementation of WanderJoin) that can produce results faster than Postgres can produce deterministic query results</strong>!</p>
In short, we've developed the world's first Fast</em> Probabilistic Database!</p>
Aaron is presently on the market!  You should hire him</p>


CURE-C STTR Phase 2
2025-03-28T00:00:00+00:00
NASA has funded our STTR Phase 2 Project, Cure-C</strong> developed in collaboration with XAnalytix Systems and Breadcrumb Analytics.</p>
This project will develop tools to assist human analysts in integrating incomplete, inconsistent, or otherwise uncertain multi-modal data.  The key challenge in this space is that most existing approaches to managing uncertain data assume an accurate model of uncertainty, the development of which can be a very substantial undertaking.  The CURE-C project will streamline the process of fusing multi-modal sensor data into an accurate, self-consistent, uncertainty-aware model over which subsequent reasoning is possible.  Our initial target will be workflows that build datasets suitable for route planning for extra-terrestrial autonomous vehicles.</p>


PL/DB Sp 2024
2024-05-09T00:00:00+00:00
With a talk from Manos Athanassoulis</a> earlier this week, we've wrapped up another semester of the PL/DB seminar here at UB.  We had a really</em> fantastic lineup this year, including five guest speakers (Jelle Hellings</a>, Hannah Gommerstadt</a>, Ryan Kavanagh</a> Boris Glavic</a>, and Manos).</p>
Talks this semester spanned a range of different subjects, from distributed programming models, to indexing and data access methods, query processing, compiler optimization, and provenance.  On the one hand, it's amazing to see such a diverse range of topics represented,  On the other, it was also nifty to see students from across the board engaging with all of the speakers (student or otherwise).</p>
Major props to Andrew Hirsch</a>, who is more/less single-handedly responsible for reviving and bringing new life into the seminar.</p>


ODIn @ HILDA '24
2024-05-07T00:00:00+00:00
'Grats to Pratik and Juseung on their #HILDA2024 accept for "Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments", which explores schema integration in longitudinal studies.
Longitudinal surveys, and specifically social sciences data collected through survey forms, are a really interesting case of schema integration.</p>
The data being collected is, on the most fundamental level, about only a single class of entity.
However, each year brings new knowledge, and new context to the survey, necessitating changes.
For example, researchers might learn that the culture of the study population uses different names in different social contexts, necessitating a change to the survey to clarify the social context of the name being recorded.
Alternatively, researchers might adapt a choice of phrasing like "how many of your family members live nearby" into "how many people are in your support network" to better address the nuanced situations.
Even without changes to the survey itself, changing context can result in changing interpretations of participant answers.

For example, take a multiple-choice question about income levels.

A single answer at the start of a 20-year study may indicate a wildly different socioeconomic status than the exact same answer given in the last year of the study.</p>
The problem of integrating many years of forms is fundamentally similar to data integration, but is in some ways easier (there are few changes between successive years), and in some ways harder (there are many</em> such changes over the lifetime of the survey).  Changes are also nuanced, with growing levels of divergence.</p>
The paper lays the groundwork for a tool to help researchers conducting longitudinal studies to prepare their data for publication, and for researchers trying to use this study data to reliably develop derived, 'clean' datasets useful for the needs of their specific study.</p>
Side Note</strong>: This paper is the result of a massively interdisciplinary collaboration between CS, Linguistics, Medicine, Stats (and soon-to-be Environmental Health).  I'm really excited that we've hit on an opportunity to develop techniques that will benefit such a diverse range of fields of study.</p>


Rust...
2024-05-01T00:00:00+00:00
I've been dinking around with Rust on a semi-serious basis for the past several months. I won't say that I've had enough time to form a fully educated opinion, but I do feel like I've gotten the general shape of the language... certainly enough to have at least a preliminary opinion.  So... here goes nothing.</p>
Background#</span></a>
</h2>
I come from the school of thought in system-building where I trust the compiler wholeheartedly.  Refactoring a large codebase is far more pleasant in Scala than in Python... and the latter becomes far more pleasant with type annotations.  I appreciate it when the compiler yells at me for changing a reference but not the referant.  I'm willing to tolerate the compiler's pedantry, because ultimately it pays off in the long run.</p>
In short, I'm the type of person Rust is aimed at.</p>
The Good#</span></a>
</h2>
I've honestly had a ton of fun writing Rust.  Not quite Ruby (or Frontier) levels of fun, but it's a very snappy language with good developer tooling.  Rust's Struct/Trait model is, I think, much cleaner than normal object inheritance for most things that I typically use a class hierarchy for.  And the borrow checker does a really</em> good job of forcing me to think through memory management issues that I'd normally wave off.</p>
Developer Tooling#</span></a>
</h4>
Rust's developer tooling is flat out the cleanest that I've seen in a long time.  The rust compiler does an excellent job of providing context for error messages.  There's a lot of 'did you mean...' types of suggestions, and those are directly integrated into the LSP.  Now, getting the LSP properly configured with sublime took a bit of time/effort, but once I did, everything just works: Easy access to docs, to suggestion implementations, and to error messages.  Rust is incredibly pedantic about code artifacts, but the tooling does a lot</em> to prevent that pedantry from becoming a barrier to authoring those artifacts.</p>
Compile Times#</span></a>
</h4>
My last major project relied on SBT.  'nuff said.</p>
Documentation#</span></a>
</h4>
cargo doc</code> is up there with JavaDoc and ScalaDoc.  Plus, cargo doctest</code>'s ability to keep code snippets up-to-date, and the fact that cargo doc</code> synthesizes documentation for all dependencies, makes the documentation ecosystem pretty nice.</p>
The Bad#</span></a>
</h2>
Rust is Judgy#</span></a>
</h4>
Most languages optimize for the typical use case.  For example, managed languages like Java just put everything on the heap, and the user doesn't have to think about whether the size of an object can be predicted at compile time.  Rust, meanwhile, puts that flexibility behind an explicit Box</code> type (resp., Rc</code>, Arc</code>, etc...).</p>
In short, Rust, the language, sits there, judgmentally eying you every time you try to do something that would just be the default in any other language.</p>
It took a while to recognize this fact, and just start using String</code>, Box</code>, and the like without worrying about whether there's a 'cleaner' solution that can avoid them.</p>
Compiler Magic is the Default#</span></a>
</h4>
A lot of idiomatic Rust leverages non-obvious compiler trickery.  As a simple example, let's say you want to map</code> a function that can return a Result</code>.  If the map returns an Err</code> on any of the inputs, we'd like to return an Err</code> from the enclosing function, something that would normally be possible with the ?</code> operator.</p>
The fact that this doesn't work makes sense when you realize that Iterator</code> is lazy.  However, it turns out that there is</em> an idiomatic way to implement the same idea:</p>
let foo: Iterator<Result<T>> = ...
</span>let bar: Vec<T> = foo.collect()?
</span></code></pre>
The collect</code> method leverages Rust's compositional From</code>/FromIter</code> traits to factor the Result</code> out of the collection when generating it.</p>
All of the individual pieces are documented.  The rules by which they are composed are obvious.  However, the fact that this is the idiomatic way to get a Result</code> out of a map</code> is not at all obvious from any of the simple documentation.</p>
In other languages (e.g., Scala, Ruby), there's a bevy of methods covering various specific, common use cases.  These are far less flexible than collect()</code>, but are far more discoverable.</p>


Librem5 and Mobile Linux
2023-04-27T00:00:00+00:00
It's been somewhere around a year with my Librem 5, and something like 3-4 daily driving a linux phone, so I figure it's time to do a quick recap of where things stand.  My experience has been almost exclusively through Phosh (Mobian and then PureOS), so it's worth noting that UBPorts, Plasma Mobile, or SXMO might offer a better experience.  There's also a new Gnome Mobile on the horizon, but that seems to share a lot with Phosh.  Additionally, I've heard that the OnePlus 6 (which appears to have solid PostmarketOS support) may address some of the hardware limitations I currently face.</p>
The Big Picture#</span></a>
</h3>
Compared to where the ecosystem was when I got my original Pinephone, there's been a mountain of progress.  Progress continues, with major announcements every month or two.  My current configuration has reached a point where I'm entirely happy with my phone.  That said, I definitely would not recommend my setup to a non-technical person, and would still be concerned that it would be too many papercuts even for someone who knows what they're doing.  Still, things are headed in the right direction.</p>
Day-To-Day Usage#</span></a>
</h3>
Calls#</span></a>
</h5>
My PPP still has issues with audio (calls or otherwise), but the Librem5 has been solid in this respect.  Phone calls, speakerphone, all work well.  It even seems to generally do the right thing when the phone is docked, picking the dock speakers over the internals for speakerphone.  There's no built-in spam detection, which is a drawback, but I gather there are some external tools that provide it as an add-in.  In short, it works as a phone.</p>
Visual voicemail is also an option, but does require additional work on the user's part to install vvmd</code>, and look up your carrier-specific visual voicemail settings.  There's also an ongoing issue with T-Mobile, where the IMAP servers (yep, that IMAP) that host their VVM data are misconfigured</a>.  A fix is currently making its way downstream.</p>
Text Chat#</span></a>
</h5>
Chatty is a decent SMS app, with support for MMS.  There are a few client-side lag spikes</a> here and there, usually when trying to load a chat with a large number of pictures, but overall it's usable.</p>
My other major chat protocol is Matrix.  Element, in principle runs, but doesn't offer an electron build for arm64.  In principle Chatty supports it, but support is buggy</a>.  FluffyChat and Fractal-Next are both available, but the latter is still in development, and the former has an ongoing bug involving GPU access, Flatpak, and Flutter.  That all being said, I recently revisited Nheko, and have so far been really happy with it.  The UI is a bit quirky, but it works.</p>
Web Browsing#</span></a>
</h5>
Firefox has been usable since the beginning, with solid Arm64 support.  Gnome Web is getting better, but still ends up being jittery on pages that FF loads without difficulty.  My one gripe with Firefox is that, although there's a mobile-friendly stylesheet applied to the interface, it's still a bit buggy.  Extension menus, for example, are unusable.  Still, I haven't had any difficulty browsing web pages.</p>
Alarm Clocks#</span></a>
</h5>
My Librem has been waking me up reliably since September.  There were some hiccups early on with my Pinephone [Pro], where Gnome Clocks wouldn't wake the phone from suspend.  This was fixed a few kernel revisions ago, and now I'd trust both platforms to wake me up.</p>
Music#</span></a>
</h5>
There are a ton of Phosh-friendly music players.  I used Lollypop for a while, and was pretty happy with it.  Recently, Cantata has taken over (since I can also use it as a remote for my home stereo).  Amberol is also quite nice (but leaves it up to you to organize your music).</p>
Data Sync#</span></a>
</h5>
SyncThing keeps most of my stuff synced between computers.  No question about it, SyncThing is awesome.  It fills the void left for me when Dropbox and I went our own separate ways.  Most of the time, I don't even realize that it's running... it just works.</p>
A few things (notably stuff I want to share) gets stored in a NextCloud instance.  PureOS and Mobian both benefit from solid Nextcloud integration into the Gnome ecosystem here.  Nextclout syncs Contacts, Calendars, Todos, and gives me quick access to my shared files.</p>
Also a quick shout out for Valent.  This is an implementation of the KDEConnect protocol that works with Gnome and Phosh.  It integrates a lot of capabilities between devices.  For example, with it running on two linked devices, media keys on one device will control a player on the other device.  Similarly, if you open a web page or sms:/tel: link on one device, you can relay the link to the other device.</p>
Tickets / Club Cards#</span></a>
</h5>
Quick shoutout to an excellent plugin</a> for phosh that lets you make a bunch of PDF files accessible from the lock screen.  I use this for a few frequent shopper cards, as well as tickets with QR codes.  Bonus: I can keep the directory synced with SyncThing.</p>

Mostly There#</span></a>
</h3>
Mobile Connectivity#</span></a>
</h5>
LTE connectivity has improved massively since I started.  Both the PP[P] and Librem5 used to have issues where the modem would spontaneously disconnect.  With a recent kernel update, the Librem5 no longer has this issue at all.</p>
Battery Life#</span></a>
</h5>
The PureOS team has been hard at work on stretching the Librem's battery life.  I can get a realistic 6 hours out of it now, which is enough that I don't feel like I need to keep the phone charging at all times.  The PPP has 10+ hours of standby by comparison; it drains faster when in-use, but suspend is still an "in-development" feature on the Librem 5. The iMX8m also has a few anti-features that make power-management more difficult (e.g., it doesn't seem to have support for low power core idling).  Still, once suspend lands, I expect this to stop being a concern.</p>
Navigation#</span></a>
</h5>
PureMaps is great.  Navigation, directions, the whole bag.  GPS on the other hand, is still a work in progress on the Librem5.  There seem to be a few minor glitches in how the modem communicates GPS data back to the phone, so whether the phone gets a GPS lock on a given boot is a bit of a coin filp.  I haven't tried it in a while, but the Pinephone Pro with Mobian seems to be better about GPS.</p>
Camera#</span></a>
</h5>
The camera app (Millipixels/Megapixels) is nice, and getting better.  The PureOS folks just landed a self-tuning feature to the camera itself.  The only reason I have 'camera' in the Mostly There column is because the camera hardware isn't exposed to userspace as a camera device.  Millipixels can talk to the raw camera interface, but most other apps (e.g., video conferencing apps, web browsers, etc...) don't see a camera yet.  Still, this is something their folks are hard at work on.</p>
Calendaring#</span></a>
</h5>
Gnome Calendar has some issues around time zone support.  It's also not fully mobile-friendly yet.  That said, there's an excellent Phosh plugin</a> that puts a list of upcoming events on the lock screen.</p>
Email#</span></a>
</h5>
Geary is a reasonable email client.  However, my org disables IMAP in Office365.  That limits me to Evolution or Thunderbird, neither of which have mobile-friendly interfaces.  I could get by with personal email, but thanks to a spate of spearphishing scams that my org gets regularly, signatures (which Geary doesn't support) are also a must.</p>
Not There Yet#</span></a>
</h3>
Convergence#</span></a>
</h5>
One of the biggest selling points of linux phones is the fact that they can be plugged into an external display/keyboard/mouse and run desktop linux apps.  The biggest win here is that you, in principle, only need one computing device.  Better still, you get continuity of state between interaction modalities:  You can just grab your phone out of the dock and continue editing or reading while e.g., riding the bus.  You can load a movie on your phone and then just connect to a projector.</p>
SyncThing and Valent subsume a lot of these needs by allowing you to migrate state between devices dedicated to a specific interaction modality (files, and browser state, respectively).  Valent in particular offers an interesting (if limited) form of convergence, by exporting device-specific capabilities (SMS, Calls, notifications, browser tabs, etc...) so that they can be accessed from a different device when the two are nearby.  Still, it would be nice to reach a point where I can just start typing a text and then plug in a big screen and keyboard.</p>
That all being said, I have occasionally tried to use my phone as a desktop replacement, and it has not stuck. I know several people who are successfully convergencing, but all of them have workflows built around the terminal.  I can see that working.  Unfortunately, the phone's 3GB of ram and internal GPU struggle to keep up with the workflows I've gotten used to on the desktop. e.g., Editing a LaTeX doc in SublimeText and compiling it is not unpleasant, but not as smooth as it is on my existing laptop or desktop.  I likely would not use the phone for anything more significant.  I gather from folks with a OnePlus 6 that the convergence experience is much better.  While I expect a motherboard upgrade will be required first, I fully expect that we'll get to this form of convergence eventually.</p>
What I'm less sure about is the Laptop/Tablet convergence story.  The NexDock2 is a wonderful piece of hardware, but has some critical ergonomic limitations.  First and foremost, the NexDock requires actual thought to set up.  I have to pull it out, open the display, swing out the phone mount arm, attach the phone, dig out the USB-C cable, plug it in, boot up the dock, and then power on the phone.  By comparison, with a regular laptop, I just pull it out.  Second, mounting the phone on the dock destabilizes it.  The Librem5's weight on the display destabilizes the friction-based hinges, making it difficult to use on a lap or in a moving vehicle.  The device is decent in tablet mode, if a bit top-heavy due to the phone's weight.  Honestly, these would all be reasonable trade-offs... but the big issue is that the NexDock is not</em> lighter than a normal laptop.   In fact, it's heavier than the Librem14, and about the same weight as my Framework.  There really isn't a compelling reason for me to cart the NexDock around at the moment.</p>
Most of the weight is in the battery and display, so I doubt there's a lot of room for weight savings. Rather, I think what needs to happen is someone figuring out the ergonomics of setting up and putting away the "laptop."  Even just getting rid of the cable would be huge.  One thought: Back in the mid 10s, Asus had a line of convergent phones called the Transformer.  Here, you would put the phone into a slot in the back of the screen.  You lose the phone's screen, but there's no faffing about looking for a display-compatible USB cable in your bag.</p>


View Serializability
2022-03-23T00:00:00+00:00
I'm a bit ashamed that it took me as long as it did to have this realization — after about six years of teaching databases, inspiration finally struck in the middle of answering a student's question.
It really is a subtle point, and a potential disconnect between folks teaching databases (or at least all of the database textbooks) and folks who actually use databases in practice.</p>
TL;DR: What virtually all database textbooks call view serializability is not</strong> what virtually all databases call view serializability.</p>
Background: Serializability#</span></a>
</h4>
First a quick bit of background: Programs are sequences of steps.  Often though, not all of those steps have to happen in sequential order.  Take the following pseudo-python code:</p>
a = f(1, n)
</span>b = f(2, n)
</span>print(f"The result is {g(a, b)}")
</span></code></pre>
We can't run line 3 until after we're done with lines 1 and 2, but we don't really care about the order in which lines 1 and 2 happen [1].  The following program should produce exactly the same result:</p>
b = f(2, n)
</span>a = f(1, n)
</span>print(f"The result is {g(a, b)}")
</span></code></pre>
When we say this reordering preserves "serializability", it just means that the end result is exactly the same</em>: There's no externally visible consequence to changing the order of lines 1 and 2.</p>
This style of reasoning is super helpful when we want to develop parallel code (or systems that run code in parallel).  If we can prove to ourselves that it doesn't matter what order we run two specific steps in, it means that we can safely run those steps in parallel.</p>
Alternatively, if we come up with some scheme that allows two programs to control how their steps are ordered, we can prove to ourselves that this scheme is correct by proving that any order of operations that the scheme would allow is guaranteed to produce an output that is exactly the same</strong> as the correct order we started with [2].</p>
Unfortunately "exactly the same" is a misleadingly hard-to-pin-down phrase.  Clearly, there will be differences.  Even if the values of variables a and b are identical, changing the order of execution will affect subtle things like where in memory a and b live, the value of a.id</code> and b.id</code>, and other minor (and generally inconsequential details).  This makes it difficult to use "serializability" for any sorts of proofs.</p>
So, when trying to prove correctness of some concurrency scheme, folks tend to come up with more precise specifications.  For example, a common form of serializability called "conflict serializability" says that we're allowed to reorder any two lines of code (call them A</strong> and B</strong>) as long as:</p>

A</strong> does not read from a variable that B</strong> writes to.</li>
B</strong> does not read from a variable that A</strong> writes to.</li>
A</strong> and B</strong> do not write to the same variable.
If two programs differ only in reorderings that respect the above rules, we call them conflict equivalent</strong>.</li>
</ol>
In our example above, all three conditions are satisfied.  Line 1 reads from f</code> and n</code> and writes to a</code>, while line 2 reads from the same variables and writes to b</code>.

Reordering lines 1 and 2 in the example above preserves conflict equivalence.</p>
As it turns out, conflict serializability is a pretty powerful tool.  You can prove to yourself that 2-phase locking</a> and optimistic concurrency control</a>, two of the most fundamental forms of concurrency control in databases both guarantee that any sequence of steps that they allow must</strong> preserve conflict serializability [2].</p>
Both of these schemes, however, are a bit expensive.

2-phase locking is really</strong> conservative and often acquires locks it doesn't need to.
Conversely, conflict serializability requires (i) the ability to do copy-on-write, and (ii) an expensive conflict checking and merge step.</p>
View Serializability#</span></a>
</h4>
Enter timestamp concurrency control</a>.  The idea here is that we attach two counters to each variable: a read timestamp and a write timestamp.
Each program (transaction) also gets assigned a timestamp (not really wall clock time... just a number that keeps growing).
The read timestamp is the timestamp of the newest (highest numbered) transaction to read from the variable.
The write timestamp is the timestamp of the newest (highest numbered) transaction to write to the variable.</p>
Timestamp concurrency control is a form of optimistic concurrency control: A transaction starts and then keeps plugging along until it discovers that it is about to perform an action out-of-order.  If/when that happens, the database cleans up after the transaction and restarts it.</p>
For example, if one transaction comes along and tries to read a variable that was already written by a newer (higher numbered) transaction, it's an out-of-order read and the transaction restarts[3].
Similarly, if one transaction tries to write to a variable that a newer transaction has already read from, the transaction is restarted.</p>
So far, so good.  Our target execution order is given by the timestamps, and after a little bit of thought you should be able to convince yourself that we're enforcing rules 1 and 2 of conflict serializability by restarting transactions that are about to violate them.</p>
What about rule 3 though?  It's easy enough to detect: A transaction trying to write to a variable with a write timestamp that's newer than the transaction's timestamp.</p>
a = f(1, n)
</span>a = f(2, n)
</span>print(f"The result is {g(a)}")
</span></code></pre>
Let's say that each line is one transaction, and that we're trying to run lines 1 and 2 in parallel.  Line 1 gets timestamp 1 and line 2 gets timestamp 2.  As it just so happens, line 2 finishes before line 1 gets a chance to write to a</code>.  Now a</code> has a write timestamp of 2 when line 1 comes along and tries to write to it.  This is clearly a violation of rule 3 (and we shouldn't allow it).  On the other hand, it's all the same if we just quietly throw its write away — the value is never actually used.</p>
Note that here we are</strong> violating rule 3, and so the resulting scheme can not be called conflict serializable.  On the other hand, this is a very controlled violation of the rule.  We're only allowed to violate it because the value we're writing is never viewed.  This is how view serializability</strong> (textbook edition) is defined: It's conflict serializability, but we're allowed to ignore writes that are never read</em>.  Moreover, you can prove to yourself that timestamp concurrency control guarantees view serializability.</p>
Textbooks and Databases#</span></a>
</h4>
So far, this sounds perfectly reasonable: View serializability (textbook edition) sounds perfectly sane and safe.  On the other hand, use the phrase "view serializability" in the vicinity of a DBA and you'll hear the explosion from miles away.</p>
Simply put, view serializability (database edition) describes a concurrency mode in several production database systems that is an incomplete implementation of timestamp concurrency control without read timestamps</em>.  This is understandable, since read timestamps are really really really expensive.  They introduce a concurrency bottleneck where there wasn't one before.  On the other hand, without read timestamps, you have no way to detect when a transaction writes to a variable after</strong> another transaction has already read from it.  For example:</p>
a = f(2, n)
</span>b = g(a)
</span>print(f"The result is {b}")
</span></code></pre>
If we parallelize lines 1 and 2, it's possible for line 2 to read whatever value happened to be sitting in a</code> from before.  We'll never detect this situation, because there's no read timestamp on a</code> for line 1 to check against.</p>
And that's pretty much it.  View serializability (textbook edition) is a slight variation of conflict serializability that permits us to ignore unseen writes.  View serializability (database edition) is a bastardized version of timestamp concurrency control with arguably useless ordering guarantees.</p>

1: Assuming f</code> is free of side effects.</p>
2: In databases, inter-transaction order doesn't matter, so there are technically multiple "correct" orders.  Still, the point holds.</p>
3: The database can keep around older versions of the variable for what's called multiversion concurrency control</a> to avoid read-after-write errors.</p>


UADBs win Reproducibility Award
2020-06-01T00:00:00+00:00
Probabilistic and Incomplete databases are a principled way to handle data that
isn't perfect (and really, who's data is perfect).  Unfortunately, pretty much
every PDB and IDB developed to date is insanely slower than their deterministic
counterparts (to say nothing of how complex and finicky they are to use
correctly).  That's why, in collaboration with IIT, for the past five years,
we've been working towards a more user-friendly approach to incomplete data
management.  Instead of trying to give people perfect answers, we just help
them keep track of what</em> is uncertain through annotations and provenance
trickery.  In other words, we're developing an Uncertainty Annotated Database
System (or UADB).</p>
Thanks in large part to the heroic efforts of Su Feng</a>,
our latest UADB paper</a>
received the SIGMOD 2020 Reproducibility Award</a>.</p>


Re-using work in data integration
2020-05-03T00:00:00+00:00
Data integration is a huge problem.  There's a ton of work out there on
automating the process of merging two datasets into a unified whole, but most of
it misses one important factor: Exceptions are the norm in data integration.
That means that data integration is a labor intensive task, involving everything
from encoding standard translations (e.g., ℉ - 32 * 9/2 = ℃), dictionaries
(e.g., NY = New York), and more complex relationships (e.g., geocoding street
addresses).  Worse, once datasets A and B are integrated, integrating dataset
C is nearly as much work.  Maybe you have some code left over from integrating
A and B (and hopefully your student/employee is still around to explain it to
you), but you really need to sit there to try to figure out which bits of that
code can be re-used... or you do everything from scratch.</p>
This is why, I got very excited when I ran into some work by Fatemeh Nargesian
on searching for unionable datasets</a>.

The idea is simple: you index a data lake, hand it a dataset, and it figures out
which datasets in the lake have "similar" columns (based on a clever use of
word embeddings).  Enough similar columns, and there's a good chance that you
can just union the datasets together.</p>
My students Will Spoth and Poonam Kumari got to talking with Fatemeh and me
about how we could use this idea to make it easier to re-use data integration
code --- basically, how could we make it easier to re-use integration work.
Our first steps towards this goal just got
accepted at HILDA</a>.
Our approach, called Link Once and Keep It (Loki) is also simple: When you
integrate two datasets together, you record the translations that got you from
the dataset to a common schema.  This is something that can easily be done in
our prototype, provenance-aware notebook Vizier</a>.  For
example, we might record how body temperature in one dataset was translated from
℉ to the ℃ used in the other dataset, or the dictionary translation of state
abbreviations (NY) to full state names (New York) used in the other dataset.</p>
Now that we have one mapping, anytime we need to translate ℉ to ℃ or to expand
abbreviated state names, we have the logic needed to pull it off.  What remains
is to figure out when to propose the translation to the user.  This is where
Fatemeh's work on unionability comes in: Whenever two columns are "similar",
there's a good chance they're of the same type and that the same mappings apply.
We took the opportunity to define a new similarity metric for numeric types,
based on the distribution of values in the data.  Unlike the prior approach
based on word embeddings, this is far more likely to give false positives, but
in this setting that's ok, since our goal is only to find and suggest
translations to the user.</p>
Loki combines this with a graph-based search for chains</em> of translations that
can be used to translate a source attribute family into a target attribute
family. This will allow Loki to answer two classes of queries: (1) What
transformations will get me from a source dataset to a target schema, and (2)
Is there a schema that I can map two datasets into with minimal work.  While
Loki is still in the exploratory prototype phase, we hope to be able to release
it one day as a slowly growing repository of translation rules.</p>


VizierDB makes an appearance at CIDR
2019-10-18T00:00:00+00:00
It seemed like a close call (one pretty nasty reviewer), but
VizierDB</a> is getting presented at this year's CIDR in
Amsterdam!  First and foremost, congrats to Mike, Boris, Heiko, Sonia, Carlos,
and Will for creating and documenting a system that got folks excited and
talking!</p>
This work builds on a lot of bits and pieces, including work with Boris, Su, and
Aaron on lightweight uncertainty propagation</a>,
work with Poonam on visualizing uncertainty in tabular data</a>,
Boris' work on reenactment</a>,
and our work on integrating spreadsheets into a notebook</a>.</p>
Again, congrats all, and see anyone who's going in Amsterdam!</p>


Papers at DBPL, SIGMOD, and VLDB
2019-06-28T00:00:00+00:00
An active summer conference season for the ODIn Lab this year.</p>

At SIGMOD next week, come to Su Feng's talk on making Probabilistic/Incomplete Databases practical.  UADBs</a> glue together the (principled but slow) idea of certain answers and the (unprincipled, but easy) standard practice of throwing away uncertainty.  The result is a system that tracks data quality, but without sacrificing performance.</p>
You should also check out our demo of Vizier</a>, a next generation notebook that is data-centric.  By passing information through spark data frames, Vizier lets you combine multiple languages (Python, Scala, SQL), multiple modalities (scripting, spreadsheets, data widgets), and lets you do some cool things with provenance.  Check it out, and also check out our video demo</a>, or get your own server</a></p>

Coming up at VLDB, check out Ting Xie's talk on compressing SQL query logs</a>.  We reduce the size of a query log drastically, while preserving an optimizer's ability to analyze it for correlated features.  You can also read Ting's summary</a>.</p>
Also at VLDB, check out our demo of "CAPE". The system explains outliers in aggregate query results by identifying portions of the input space that would "counterbalance" the outlier.  In other words, the system finds which input data records are enough of an outlier to shift the aggregate result one way or another.</p>

Also in recent conference news, Darshana gave a great talk at DBPL on using handles</a> to create a data structure that has nearly all of the benefits of immutable data structures, while being much more amenable to adaptive on-line optimization.</p>


A retrospective on 2 years of linux
2019-06-01T00:00:00+00:00
It's been just over a year since I switched to my Librem 13v3 from a Macbook Air.  About two years since I migrated my Hackintoshes to Linux.  As the school year draws to a close, I think it's time for another quick retrospective on the switch.</p>
Why did I switch?#</span></a>
</h2>
It's hard to pin down exactly what triggered the big move.  It was a culmination of a lot of little things.  Apple swapping out the classic save/open file dialog boxes with new ones that weren't keyboard accessible.  Dropbox removing the Public folder.  A general trend towards more locked-down and less friendly hardware.  Eventually I came to the conclusion that The Switch was going to happen sooner rather than later, and I decided to test the waters.  I still have a lone mac laptop for the odd thing here and there that I still can't do on Linux, but the vast majority of my computing now happens on Linux-powered hardware.</p>
What do I miss?#</span></a>
</h2>
There are a few things that Linux still hasn't replicated.  I still keep a mac around for the following:</p>


Numbers</strong>: An under-appreciated gem of an application.  AlternativesTo.com lists Numbers as a spreadsheet, and that's certainly a role that it can play.  However, if spreadsheets were the next step in visual programming, Numbers is the step past that.  I can't point at any one thing that distinguishes it: Multiple distinct grids on a single sheet, explicitly distinguished header rows, footer rows, and header columns, and the use of header values to identify cells all combine to create a more intuitive, user-friendly data management tool.  I have yet to see another application mimic it on any platform.</p>
</li>

Preview</strong>: Apple has a history of shoving crazy amounts of utility into simple default apps.  For example, back in System 9, SimpleText could read VRML files.  Preview is another example of what was once a swiss army knife of PDF management.  You can easily add signatures, add annotations, insert/delete/reorder pages, rotate pages, and more.  The closest I've found for Linux is a combination of PDFMixTool and eVince.</p>
</li>

Painless Color Profiles</strong>: It's possible to get good color prints out of a Linux box.  It's just not easy, and in general seems to require some expensive calibration hardware.  OSX color profiles work more or less out-of-the-box.</p>
</li>

iTunes</strong>: iTunes itself started off bad and has been progressively getting worse.  However, as far as I can tell, it's presently the only (legal) online streaming service that allows you to purchase</em> and download</em> media.  You can buy BDs/DVDs and rip them.  You can stream off netflix/hulu/amazon for a monthly fee.  There's also a number of smaller sites like BandCamp that are amazing for more niche stuff.  However, for popular media that you only need to pay once for, iTunes is the only legal way to download (and there are a handful of ways of liberating videos once you legally purchase them)</p>
</li>

OmniGraffle</strong>: Although Inkscape has replaced it for me, OmniGraffle remains my favorite vector graphics app.  Inkscape is far more powerful, but far far far more irritating to use.  Omni knows how to design a user interface, and 90% of what I need to do is just straight up fewer clicks in OmniGraffle.  You can be a lot more pedantic and detailed in Inkscape, and Inkscape's plugins (PDF Import, Image-to-Vector) are nothing short of magic to me, but for most of what I do OmniGraffle is still easier.</p>
</li>

Microsoft Remote Desktop</strong>: It's not so much that I miss it as I occasionally need it for work.  There's OSX and Windows versions but not a Linux version.   There are Linux RDP clients, but I haven't been able to get them to work with the particular setup my employer deigns to use.</p>
</li>
</ul>
Pleasant Surprises#</span></a>
</h2>
It feels odd to say, but there's a few user-facing areas where Linux and the Librem are just straight up ahead.</p>


Freedom from Dongles</strong>: I cant say how liberating having a HDMI port is.  Everywhere I go there's a HDMI cable on the projector.  I can just plug it into my laptop and be on my way.</p>
</li>

Power Overwhelming</strong>: The Librem's battery is legitimately beefy.  When I turn the brightness down to 40-50%, I can use the laptop pretty much for an entire work day without plugging it in.</p>
</li>

Painless Printing</strong>: CUPS is fantastic (Thanks Apple, I guess).  With the exception of color profiles for photos, Linux has had just a trivial time working with each and every printer that I've tried to set up.  My wife's windows laptop, by comparison, struggles.</p>
</li>

Cantata</strong>: Cantata is hands-down the best music client I've used in a long time, and I think that's largely because it's purely a frontend for MPD.  MPD is an extremely powerful interface-free music player.  Cantata doesn't need to worry about support for media formats, playlist management, or nearly anything else, since all of that is handled by MPD.  That means it can focus entirely on being a great user experience, which it is.  I haven't been this happy with a music player since SoundJamMP.</p>
</li>

Gnome Calendar</strong>: It's a simple thing, but it's implemented well.  Since the Gnome Calendar devs added a "week" view some months back, it's been my favorite calendaring app across platforms.</p>
</li>

Inkscape</strong>: It's not as user-friendly as other options, but holy crap it can do just about anything to any vector format.  PDF to SVG conversion.  'nuff said</p>
</li>

Characters</strong>: (Note, NOT "Character Map")  Something that I'd unexpectedly found myself missing was the OSX Glyph picker.  It was great, a panel that would just open up into any app from the text input menu, find a glyph/emoji/etc..., and just type it as if by keyboard.  Gnome Characters replicates that convenience.  It lets you search visually for a desired glyph, and then copy it into the clipboard.</p>
</li>
</ul>
Revisiting Earlier Gripes#</span></a>
</h2>
Last year after a few months with my Librem 13v3, I observed a handful of quirks.  I'm pleased to say that most of them were transient.</p>


The keyboard took some getting used to.  The trigger point is different from where it was on the Macbook Air, which is, I suspect why I was getting misfires.  A year in, and the problem is gone, probably as a result of me getting used to the keyboard.</p>
</li>

The keyboard pipe-character glitch was finally addressed.  Manual hacks no longer seem to be required.</p>
</li>

The WiFi LED now behaves as expected relative to the hardware kill-switch (although I'm still unsure of the value of also having a button on the keyboard to software-kill the wifi).</p>
</li>

It seems less frequent, but I still occasionally get the thing where the CPU thinks that it is overheating when it wakes up from sleep.  The fix was always pretty simple, put it back to sleep and wake it back up... but it's frustrating when it happens.</p>
</li>
</ul>
Alternative Software#</span></a>
</h2>
By the time I switched to Linux, I'd already been migrating to cross-platform software.  Things like SublimeText, Firefox, and all my electron apps were already there when I switched.  Apart from the apps I've already mentioned, here's a quick list of the apps I've brought into my workflow.</p>


iTerm2</strong>: KiTTY provides a similar feature set including tiling, Unicode fonts, and in-console image rendering.</p>
</li>

OmniFocus</strong>: I switched over to TaskWarrior.  The UI, in particular the task entry UI isn't quite as user-friendly (it's on the console), and it doesn't have a particularly good mobile app, but it's just as powerful, and has a self-hosted sync service.</p>
</li>

Mail.app</strong>: Evolution is a darn good mail app.  I'm not a fan of how it uses multiple levels of nesting for threaded conversations (as opposed to Mail.app which just nests the thread one level deep under the first message), and its implementation of unified mailboxes is a little slow and awkward to set up.  However, it handles multiple accounts well, has a nice friendly multipane view of inboxes, messages, and mail, is pretty zippy, and has built-in-support for GPG.</p>
</li>

iCloud</strong>: Nextcloud does most of what iCloud used to do for me.  Plus, it's self hosted.</p>
</li>
</ul>
Other Thoughts#</span></a>
</h2>
It's interesting to see Purism pushing its Librem One service.  It seemed a bit out of left field, but it really makes sense for them to do.  It's a supplemental revenue stream (5000 * $7/mo = $35,000/mo) that also starts rolling out some of the functionality (VPN, Matrix hosting, etc...) that folks are going to expect to have on their phone with the Librem 5 comes out.  It's a good business decision from a number of angles, and seems like it could be good for the community as well.</p>
I</p>


Query Log Compression for Workload Analytics
2018-11-02T00:00:00+00:00
Analyzing database access logs is a key part of performance tuning, intrusion
detection, and many other database administration tasks. Unfortunately, it is
common for production databases to deal with millions or even more queries
each day, so these logs must be summarized before they can be used. On one
hand, we want to compress logs to facilitate efficient storage and human inspection.
On the other hand, we want to accurately infer frequencies of patterns that
are of interest to workload-analytic applications. We established a framework
for inferring pattern frequencies in a principled way using only a small subset
of patterns and proposed an efficiently computable measure of overall inference
accuracy. Achieving higher accuracy requires more patterns, but we found that
runtime of pattern mining algorithms also steeply increase. We hypothesize that
this is due to mixing workloads and proposed to partition the log into separate
clusters. By clustering, the search space of candidate patterns are reduced and
we empirically showed that state-of-the-art pattern mining algorithms can be
greatly improved both in runtime and accuracy. We further improved the effectiveness
of clustering to the extent that as we create more clusters, each cluster
becomes easy enough for pattern mining such that different algorithms do not
vary much in accuracy. As a result, we finally proposed naive mixture encodings
which focuses on partitioning workload mixtures and summarize each partition
using the most efficient though naive encoding. We showed that naive mixture
encoding is orders of magnitude faster to construct and provides summarization
accuracy competitive with more complicated pattern mining algorithms.</p>
Read more in the preprint</a></p>


Papers in VLDB 2019 and TODS
2018-10-16T00:00:00+00:00
Two new fantastic publications out of the ODIn Lab.</p>
First, Niccolò Meneghetti's SIGMOD 2017 paper "Learning From Query-Answers: A Scalable Approach to Belief Updating and Parameter Learning," written in collaboration with Wolfgang Gatterbauer was invited as a "Best of SIGMOD" paper to ACM TODS.  This paper has now been accepted (preprint here</a>).</p>
Next up, Ting Xie got a new VLDB 2019 paper: Query Log Compression for Workload Analytics</a>.  The paper explores techniques for compactly encoding lossy summaries of query logs for use in optimizers and workload analytics.</p>


Congratulations Graduates
2018-08-23T00:00:00+00:00
Congratulations to the ODIn Lab's two</del> three newest alumni: Gokhan Kul and Gourab Mitra and Lisa Lu</em>.</p>
Gokhan will be starting at Delaware State University</a> as an Assistant Professor.</p>
Gourab starts at Datometry</a> today (yesterday?).</p>
update Sept 15:#</span></a>
</h4>
... and Lisa starts at Wells Fargo</p>


Review: Purism Librem 13v3
2018-08-12T00:00:00+00:00
About two years ago, I decided that Apple had given up on the geek audience (which'll probably be the subject of another post).  Since then, I've been gradually switching over to Linux for my various systems.  So, when it came time to replace my (still surprisingly ageless) 6 year-old Macbook Air, I decided to explore my options with Linux-based systems.  I'd originally been drooling over Purism's Librem 11</a>, a tablet/transformer laptop with PureOS, a Linux distro custom designed to work with the hardware.  According to the internet, setting up a Linux-based tablet was an exercise in frustration, so the promise of one specially built for Linux was too good to be true.  Turns out it might still be...  About a year in now, and the 11 is still in the "Pre-order" stage.</p>
Still, one of the things I loved about Apple products was the hardware/software co-design.  The hardware "just-worked".  So, when delaying for the 11 was no longer reasonable (and when I realized that I'd have to carry around a MiniHDMI adapter), I decided to get a Librem 13v3</a>.  It's now about 2 months in, and I feel like I've given the laptop a good workout, so I wanted to share my experiences.</p>

TL;DR: After a rough start, I'm happy with the laptop.  It has quirks, to the point where I still wouldn't recommend it for anyone who doesn't factor ethics into their purchasing decisions (i.e., most people).  Still, the company has made a great start and is moving in the right direction.</p>

Overview#</span></a>
</h1>
Purism bills itself as a security-, freedom-. and privacy-oriented hardware/software platform.  To that end, all of their devices come standard with external toggle switches, one for the Camera/Mic and one for the BT/WiFi, that physically disconnect the hardware from the motherboard.  All of their devices have the Intel Management Engine neutered, and the company's hardware and software are endorsed by the Free Software Foundation.  In short, using this makes you morally superior to all other computer users</sarcasm>.  Seriously though, they take a lot of effort to respect their customer's freedom and privacy and set secure defaults.  I think that's great to see from a company and I want to see more of it.</p>
A secondary benefit, albeit not one that they really advertise, is that they're building a linux distro specifically for the hardware that they sell.  One of my concerns with Linux in general is the degree of tinkering that it seems is often required to get something like sleep mode, wifi, or sound working.  Having the same company developing both makes it far more likely that the hardware will just work, and that's something that is incredibly valuable.  I like tinkering, but I like not needing to tinker when there's something else on my plate.</p>

The Good#</span></a>
</h1>

The laptop is boring</strong>: This is actually a good thing for a piece of hardware.  I can do what I need to do and I don't get surprised by stuff randomly breaking.  It does what it needs to do, and stays out of the way.  Sure, it's sleek, but it works, and that's the important thing.</li>
Great battery life</strong>: I regularly get the advertised 6-7 hours out of it, even when doing some heavy web browsing, video, or code/LaTeX compilation.</li>
Hardware kill switches</strong>: I was a little afraid that repeatedly plugging/unplugging the wifi card or camera might drive Linux nuts, but this feature has been well implemented.  Turning the camera or wifi on and off works exactly as advertised, and has been surprisingly useful.</li>
Gnome 3 is good</strong>: Admittedly, Linux isn't quite up to Apple UI standards yet, but it's getting there.  My experience with PureOS (Gnome3 specifically) is the first Linux that I'd recommend to a non-technical person.</li>
</ul>

The Not So Good#</span></a>
</h1>

Finicky Keyboard</strong>: The 13v3's keyboard can sometimes be finicky.  The tactile feel of the keyboard is great, but it sometimes misses keystrokes.</li>
The '' key</strong>: They messed up the 13v3 US keyboard's firmware and the '' key sends the wrong keycode.  Worse, it seems that they can't fix the problem in the OS without breaking all of the 13v3 EU keyboards.  There's a 2-line patch that is easily found on the internet that fixes it, but given the price point of the Librems, it's something you'd expect to "just work".</li>
WiFi</strong>: I've been spoiled by Apple.  Antenna design is voodoo, and Apple employs some of the best witch doctors on the planet.  The antenna in the 13v3 isn't quite as good as any of my Apple devices, and there are now some rooms in my house where I can't use Skype.</li>
Hot Sleep</strong>: This issue doesn't appear to be that widespread, but periodically when I wake my Librem from sleep by opening the case it decides that it's overheating, throttles the CPU to min speed, and pushes the fans to full blast.  Putting the laptop back to sleep and waking it up again fixes the problem, but (1) not something I'd expect to see at this price point, and (2) something that has had an un-addressed forum post</a> up for several months now.</li>
</ul>

The Linux#</span></a>
</h1>
Switching from OSX to Linux was surprisingly painless.  The Gnome app suite is a surprisingly good stand in for the Apple app suite.</p>

Mail</strong>: Evolution</a> is competitive with Mail.app.  I do have a few nits, but it's actually a really nice app. I'd like to see a more streamlined search that defaults to searching everything and that resets after you change views (more inline with how Mail.app works).  Threading support is also a little awkward: Indenting by the same amount for every response in a thread makes viewing long threads incredibly painful.  Also, I've hacked together a unified mailbox using search folders, but it would be nice to see that sort of thing directly available through the UI.  It would also be great to have the option to automatically sync an IMAP server for offline use, but there's apparently some tools that will do that for you.</li>
Calendar/Contacts</strong>: I've been really happy with Gnome's Calendar</a> and Contacts</a> apps.  Native integration with Nextcloud</a> works great with these apps, as does integration with the Gnome shell.  I'm a little annoyed that Contacts doesn't offer a "Copy Email" button (just a "Send Email"), but that's small potatoes.</li>
Music</strong>: iTunes has been getting steadily worse over the past half decade.  I've been using Clementine</a> on the mac for a while, so switching to that on Linux actually improved the user experience.  I still wasn't happy with music search, the way the playlist just kept growing until I manually deleted something, or the fact that it took a right click and menu selection to play anything.  I tried out the Gnome default, Rhythmbox</a>, but wasn't super happy with the text-only album search.  I'm a visual person and I like browsing through my album covers.  I finally settled on a fantastic app called Cantata</a>, which provides a great UX for a barebones music playing daemon called MPD</a>.  MPD can sometimes be finicky, but Cantata can set up a built in version that works splendidly.</li>
Task Management</strong>: I was addicted to OmniFocus</a>.  It was an Amazing</strong> tool, in no small part due to the really low-friction task entry window that they put together.  Nothing else that I've found quite replicates that, but I've found a reasonably close replacement called TaskWarrior</a>.  It uses the command line rather than a special pop-up window, but it's pretty slick.</li>
Code Editor</strong>: It's not free in either sense of the word, but I'm a devoted SublimeText user.  Apart from the need to tweak a few key combinations, my text editing experience did not change with the platform.</li>
</ul>
I have a few gripes that are specific to linux:</p>

Linux presently lacks support for USB video adaptors. Not a huge issue since there's HDMI out on the Librem, but it means no USB-C Video dongles, or docking stations with DVI out.</li>
Linux apps are developed with a mix of different key conventions.  This used to drive me nuts in webapps, where I'd expect emacs keybindings like ctrl-d</code>.  I'd hit some emacs key (which any OSX text window will respond to), and instead get some app-specific behavior.  Now I get the same thing across the entire OS.  There are some</em> conventions like ctrl-q for quit, but not every app respects them, and sometimes you get alt-f4 to quit instead.  It's taken a lot of getting used to.</li>
</ul>

The Confusing#</span></a>
</h1>
Finally, there are a few oddities about the Librem that aren't really issues, just confusing.</p>

There's a software wifi kill switch (fn-f3) in addition to the hardware one.  This is something that would be super-easily ignored, except for the fact that the keyboard also has a WiFi LED that is controlled by the software rather than the hardware kill switch.</li>
There's no num-lock LED.  This can lead to very confusing password-entry sessions, as the entire right third of the keyboard becomes a numpad with NumLock on, and there's no way to tell when NumLock is on.</li>
The hinge seems like it'll open with one hand, but actually requires 2 to open (or some interesting contortions).</li>
The three volume buttons ("mute" and "volume+", "volume-") are scattered across opposite sides of the keyboard</li>
</ul>

The Summary#</span></a>
</h1>
Now that I've got the thing properly set up, it's easy to adjust my workflow around the Librem's remaining quirks.  In spite of those quirks, I'm still happy with my purchase.  Admittedly, part of this is viewing my purchase from a moral/ethical standpoint: I'm supporting a company that actively tries to respect my freedom, security, and privacy.  I want to see them succeed, and help them get to a point where I can also recommend them to someone making the purchase solely from a usability standpoint.  They're not there yet, but I see it happening.</p>


Vizier Workflows (rant)
2018-03-14T00:00:00+00:00
I'd like to talk a little about abstractions for communication.  In particular, I want to talk about a favorite workhorse of the data science community these days: Jupyter notebook.  For those unfamiliar with it, Jupyter users work with blocks of code called "cells." Each cell has an opportunity to produce a result, which is then displayed inline, immediately after the cell.  This makes it a lot easier for users to break up complex tasks, showing intermediate results inline with the rest of the code.</p>
Let's spend a little time digging in to how this works.  For each language that supports Jupyter (python, scala, ruby, and more...), the developers have created a way to snapshot the language's state: global variables, runtime information, file handles and more.  They call this a kernel</em>.  When you execute a cell, Jupyter loads the kernel, runs code against it, and saves the result.</p>
This means that if you want to have code in one cell talk to code in another cell, the natural way to do it is to create a global variable.  The fundamental communication abstraction in Jupyter is the kernel.  On the one hand, this is a very powerful abstraction: anything that you can represent using a global variable in Python can be sent.
On the other hand, it also means that the main way to communicate is through language-specific binary blobs.</p>
At UB, we're working with NYU and IIT on a data exploration tool called Vizier</a>.  Expect to hear more about Vizier here in the coming weeks and months, but what I want to focus on right now is the fact that cells in Vizier talk through tables</strong> (or DataFrames or Relations, if you like).  The fact that they're tables isn't even all that important; What we care about is the fact that they're in a standardized format that Vizier understands.  This is why data debugging in Vizier is easier, and why we expect to be able to provide some powerful query optimization down the line.  Again, more on each of those as they develop.</p>
What I want to focus on today is interoperability. Because all communication in Vizier happens through tables, you can write a python script that transforms data in one cell and a SQL query over the same data in the next.  Better still, it means that we can allow direct manipulation of data: For Vizier, we're developing a new language called Vizual.  Every expression in Vizual corresponds to an action in a spreadsheet (rewriting a cell, adding a formula, etc...).  So, you can write a python script, manually fine tune the output table as a spreadsheet, and then query the results.  None of that would have been possible if the underlying communications abstraction was opaque to Vizier.</p>


NEDB 2018
2018-01-08T00:00:00+00:00
Look for the ODIn lab at the North-East Database Day</a> poster session:</p>

Just-In-Time Data Structures (Darshana, Saurav)</li>
Data Synthesis for automatically generating Smartphone Database Benchmarks (Gourab)</li>
MESS: Meta-data Extraction System for Schemas (Will)</li>
Incoming Query Prediction on Mobile Databases with Probabilistic NFA (Gokhan)</li>
</ul>
And congrats to Poonam for getting a talk:</p>

"The Good and Bad Data", Session 4 at 3:10</strong>!</li>
</ul>


CSE 50 Conference!
2017-09-24T00:00:00+00:00
UB-CSE is celebrating 50 years of Computer Science and Engineering</a>.  The celebration will kick off on Thursday the 28th with a welcome reception and continue with events through Sunday.  Especially exciting are sessions on Thursday, Friday, and Saturday involving UB-CSE's latest and greatest research.</p>
The ODIn Lab will be showing up in force at the CSE50 Undergraduate and Gradeuate conferences</a>.</p>

Lisa and Olivia will demo Mimir at the Undergraduate Event during the Welcome Reception on Thursday.</li>
Poonam, Will, Aaron, Shivang, and Lisa will present on Mimir</a> at the Graduate Poster Session on Saturday.</li>
Saurav and Darshana will present on JITDs</a> at the Graduate Poster Session on Saturday.</li>
Duc, Ting, and Gokhan will present on The Insider Threats project</a> at the Graduate Poster Session on Saturday.</li>
Gourab, Gokhan, and Carl will present on The PocketData project</a> at the Graduate Poster Session on Saturday.</li>
</ul>
Congrats to everyone for their hard work in getting everything ready on time.</p>


Beta Probabilistic DBs at SIGMOD
2017-05-14T00:00:00+00:00
Niccolo Meneghetti is presenting his work on Beta-Probabilistic Databases</a> is being presented this week at SIGMOD.  Congrats to Niccolo and Wolfgang on this incredible piece of work.</p>


CIDR Recap
2017-01-31T00:00:00+00:00
How big is BIG and how fast is FAST? This seemed to be a re-occurring theme of
the CIDR 2017 conference. A general consensus and major point of many
presentations is that RDBMS used to be the king of scaling to large data twenty
years ago but for some inexplicable reason has become lost to the ever changing
scope of BIG and FAST. Multiple papers attempted to address this problem in
different ways and added to multiple different tools on the market for data
stream processing and large calculations such as SPARK but there seemed to be
no silver bullet. To add to the theme that big data is too big, there were
keynote talks given by Emily Galt and Sam Madden that drove this point home and
gave different real work scenarios and outlooks on this problem.</p>
To break this theme apart I’ll split the papers into groups and explain the
different outlooks the authors took and how they addressed this common problem.</p>
The papers, Prioritizing Attention in Analytic Monitoring, The Myria Big Data
Management and Analytics System and Cloud Services, Weld: A Common Runtime for
High Performance Data Analysis, A Database System with Amnesia, and Releasing
Cloud Databases for the Chains of Performance Prediction Models, were focused
on the theme that databases are not keeping pace with the rate that data is
growing. Sam Madden brought up an interesting point that the hardware
components like the bus are not the bottle neck in this system. With advances
in big data computing like apache spark, it feels like RDBMS are the end of the
line where data goes to die. These papers looked at different ways of
addressing this, ‘A Database System with Amnesia’ looked at throwing out unused
data since most data in RDBMS gets put in and never used again and with the
increasing use of data streams the problem of not being able to process and
store this data fast enough becomes exemplified.</p>
The second common ground problem is even if you can efficiently store and
perform queries over your data lakes, humans often lack the ability to
efficiently create queries or have the necessary insight into how the data is
formatted. The papers, The Data Civilizer System, Establishing Common Ground
with Data Context, Adaptive Schema Databases, Combining Design, and Performance
in a Data Visualization Management System, all try to address this problem but
from slightly different angles. The data civilizer system and adaptive
databases look at aiding an analyst in schema and table exploration and to help
an analyst discover unknown or desired qualities about their data source. These
papers approach user insight in a way that would otherwise exist as internal
middleware in large companies, the problem is that big data and messy data
lakes are becoming more and more prevalent for other users. Medium sized
businesses can be buried in data following user surges or new product upgrades,
government agencies can have large amounts of uncleaned sensor and user
submitted data that they do not have the abilities or tools to manage.</p>
To me a large take away from this conference was databases need a better way to
handle big data. Databases are the hero big data needs AND the one it deserves.
To achieve these goals databases are going to need to relax the constraints on
ridged schemas and ‘perfect’ data, which open up a large amount of research
opportunities and the realization that there might not currently be a ‘right’
answer to this problem. Either way it should be interesting to see what
sacrifices RDBMS make to compete with the growing amount of data and if they
are able to apply decades worth of research to this hot field that is looking
for an answer.</p>


Stop the Truthiness and Just Be Wrong
2016-12-13T00:00:00+00:00
Note</strong>: This was originally an abstract submitted to CIDR. It's based on numerous discussions with lots of people, including but not limited to: Ying Yang, Niccolò Meneghetti, Poonam Kumari, Will Spoth, Aaron Huber, Arindam Nandi,
Boris Glavic, Vinayak Karuppasamy, Dieter Gawlick, Zhen Hua-Liu, Beda Hammerschmidt, Ronny Fehling, and Lisa Lu.</p>
Since their earliest days, databases have held themselves to a strict invariant: Never give the user a wrong answer.
So ingrained is it in the psyche of the database community, that those violate it really want you to be aware that you're committing sacrilege against Codd.  Some examples include adding features to SQL to support continuous data (e.g., MauveDB</a>), adding features to SQL to query bayesian models (e.g., BayesStore</a>), adding features to SQL to tell the database how accurate you want your results to be (e.g., DBO</a>), or adding features to SQL to explicitly ask for specific types of summaries (e.g., MayBMS</a>).</p>
Sadly, by trying to enforce perfection in the database itself, database systems fail to acknowledge that the data being
stored is rarely precise, correct, valid, or unambiguous.  Emphasizing on certain, deterministic data forces the use of
complex, hard-to-manage extract-transform-load pipelines that emit deceptively certain, “truthy” data rather than acknowledging ambiguity or error.  The resulting data is often (incorrectly) interpreted as fact by naive users who have no reason to believe otherwise.  The problem is getting worse: As more decisions are automated, even small truthiness errors can drastically impact peoples' lives.  Data errors in credit reports</a> can cause perfectly honest people to be denied access to credit.  Similarly, name matching errors combined with rigid protocols have led to an 8-year old being identified as a terrorist</a>.</p>
System designers must decide between presenting erroneous data as truthful or risk discarding useful information, and many choose the former.  The database community has already begun treating uncertainty as a first class primitive in databases</a>.  Unfortunately, uncertainty also requires us to rethink how humans interact with data.</p>
Here, industry has done significantly better than the database research community.  For example, personal information managers like Apple Calendar and the iOS Phone App increasingly use facts data-mined from email to automatically populate databases in their contacts and calendar applications.  For example, the OS X Calendar app finds events in your email and schedules them.</p>

Similarly, the iOS Phone App makes use of phone numbers it finds in your email to predict who's calling you.</p>

Both examples illustrate a number of good design elements:</p>

The interface keeps uncertain facts distinct or clearly marks them as being guesses.

The Calendar App uses greyed out boxes and a special calendar for guessed events</li>
The Phone App explicitly prefixes guessed names with "Maybe: "</li>
</ul>
</li>
The interface includes intuitive provenance mechanisms that help to put the extracted information in context.

Both Apps provide a "Show In Mail" link in the detailed information view.</li>
</ul>
</li>
The interface includes overt feedback options to help the user correct or confirm uncertain data.

"Add To Calendar" or "Ignore"</li>
</ul>
</li>
</ol>
We as a database community need to start adapting these techniques to more general data management settings.  The presentation layer isn’t the only problem, as identifying sources uncertainty requires developers to invest lots of upfront effort rethinking how they write code. We need to make it worth their while. For example, we might provide infrastructure support to help developers draw generalizations from ambiguous choices</a>. We might streamline imperative language support for uncertainty</a>. Or, we might define higher-order</a> data transformation primitives</a>.</p>
In summary, the illusion of accuracy in database query results can no longer be maintained. Database systems must learn how to acknowledge errors in source data, and how to use this information to effectively communicate ambiguity to users. Moreover, this needs to happen without overwhelming users, without breaking the decades-old abstractions that people understand and use on a day-to-day basis in their work-flows, and without requiring a statistics background from all users.</p>


Mimir group at CSE50 Kickoff Poster Session
2016-11-07T00:00:00+00:00
The department's CSE50</a> kickoff event on Friday had a Huge</b> poster session.  It was great to see everyone's work, and congrats to the 3D Printing Elderly Care group for their win.  I'm biased of course, but the Mimir group had a great showing as well!  Thanks to the entire group for pulling through and getting the poster ready in time.</p>


</center>


Mimir at CIDR, EDBT
2016-10-17T00:00:00+00:00
Some great news for the Mimir project.  After picking up a massive $2.7m
grant</a>
this summer (in collaboration with NYU and IIT) to build an interactive data
curation system, we just got notified of two new paper accepts.</p>
Adaptive Schema Databases</a>#</span></a>
</h3>
Schemas are useful.  They give you a common language to use when talking to
your database, and they help you from doing dumb things like putting data into
the wrong column.  Unfortunately, they're also hard... so many people avoid
using them.  In collaboration with IIT and Oracle, William Spoth and Ying Yang
outlined a system for dynamically generating schemas from semi-structured data,
and allowing systems to flexibly evolve schemas over time.</p>
Convergent Inference with Leaky Joins</a>#</span></a>
</h3>
Inference in graphical models requires a lot of hand tuning.  Approximation
algorithms are fast, but imprecise.  Exact algorithms work well, until they
don't.  In this paper, Ying Yang described a new ``Leaky Join'' operator
that allows for convergent-online inference.  In short, a query plan consisting
of leaky join operators behaves like an online algorithm in that it produces
(high quality) approximations prior to completion.  However, unlike classical
online algorithms, it is guaranteed to converge with only minimal overhead
compared to a standard classical join.</p>


Project Based Learning
2016-01-01T00:00:00+00:00
It is important for students to learn the ideas behind complex concepts like mathematics or computer science.  However, all too often, students are left without a good understanding of why those ideas are important.  Understanding the "why" of a topic serves not only to help students apply the abstract ideas in the real world, but helps motivate them to care about abstractions that would otherwise be opaque, isolated, and quickly forgotten after the exam.</p>
Project-based learning is a great way to help students understand why your subject matters.  However, it is important to avoid projects that railroad your students through.  Below, I've codified a set of guidelines that emerged over three years of teaching CSE-562 Database Systems.  Hopefully these can be of some use to you in the design of your own project-based courses.</p>

Simulate reality#</span></a>
</h2>
Express project goals in terms of real world problems with clearly defined measures for success.</p>
Why#</span></a>
</h3>
I find that students are often the most motivated when when they can see a clear application of the material that they're learning.</p>
Example#</span></a>
</h3>
In databases, students build a SQL query processing engine -- exactly the same thing that is in Oracle, DB2, SqlServer, Postgres, Mysql, and every other database engine out there (albeit a lighter-weight version of the same)</p>

Use second-order metrics#</span></a>
</h2>
Do not express goals in terms of canonical correct solutions, but rather in terms of measurable properties like time taken, resources consumed, or adherence to a general end-to-end specification (the memory allocator allocates disjoint regions and has no memory leaks).  Numerical metrics are best for this purpose.</p>
Why#</span></a>
</h3>
Phrasing goals in terms of metrics with clear utility (time, etc...) provides a baseline for students to understand what they're trying to accomplish.  Using second-order metrics (vs e.g., examining the RA tree produced by the query optimizer) helps students to think in terms of how to apply the abstract concepts from class, instead of thinking in terms of how to replicate the teacher's solution.  Using numerical metrics creates a gradient of 'rightness' that helps students to appreciate how well they're doing.</p>
Example#</span></a>
</h3>
Database students are evaluated in terms of the time taken to evaluate a query.  Times are selected to encourage students to either use course materials or express a high degree of personal creativity.</p>

Allow multiple submissions#</span></a>
</h2>
Use automated online grading tools to allow students the opportunity to experiment and submit multiple times.  Be sure that the automated tools provide extensive feedback.</p>
Why#</span></a>
</h3>
Experimentation creates opportunities for emergent learning.  Concepts that students 'reinvent' on their own are internalized much better than material conveyed in a lecture, or even applied to a project.  Furthermore, multiple submissions gives struggling students a reason to keep trying and learn from their failures, rather than being petrified of the possibility of failure.</p>
Example#</span></a>
</h3>
The database project has an online submission system that evaluates the performance of the student's code.  When the student receives any less than a perfect grade, it also provides a detailed report about where and how the student's project failed.  Although this results in much longer office hour queues, I've found students to be far more invested in learning how they can do better when there's an opportunity for grade improvement.  Also, it creates opportunities for emergent learning.  I'm occasionally approached by groups who propose 'improvements' to a project that amount to material that I haven't covered yet.  Discovering it rather than just hearing it allows these students to develop a much better understanding of that material.</p>

Grade the ends, not the means#</span></a>
</h2>
Do not penalize alternative or unexpected solutions.  If necessary, revise project description for later copies of the class.</p>
Why#</span></a>
</h3>
In the real world, the students' goal is to solve problems, so let them do that.  It's possible that a student may come up with a solution that does not achieve the pedagogical goals of the project.  This will often happen repeatedly.  When it does, ask yourself why the alternative strategy would not work in the real world and alter your evaluation strategy to be a more realistic simulation.  Often, you'll find that you learn something new about your subject area.</p>
Example#</span></a>
</h3>
Databases had a project evaluation that was designed to encourage students to use indexes, a type of data structure that uses precomputation to enable fast access to data.  Students were given an un-timed precomputation period, and a timed query period.  Some students noticed that the queries were all the same and had the bright idea of pre-computing the query results during the un-timed period.  In the following year, the evaluation system was modified to pose a different query with every test run.</p>

First grades, then ego#</span></a>
</h2>
Use grades to evaluate baseline</strong> student understanding of the material.  Encourage those who are interested in the subject to go further by using non-graded rewards like ego-boosters (top-10 lists), baked goods, or perks like exemption from class chores.  Friendly competitions or stretch goals can give students who are already excited a reason to push their limits, and to encourage their peers to push as well.</p>
Why#</span></a>
</h3>
Although it would be nice if everyone was as enthusiastic about your course as you, not every subject 'clicks' for everyone.  That's ok.  You can't force someone to enjoy a subject.  Ultimately, you have two goals: (1) Ensure that everyone</strong> in the class understands the material, and (2) Kindle a passion for the subject for some</strong> members of the class.  Often, teachers focus on one of these so much that they forget about the other.  Passionate students often don't need a strong reason to push their limits.  Non-graded rewards give those students direction and an outlet for their passion, without penalizing those students who just aren't interested.</p>
Example#</span></a>
</h3>
The requirements for an 'A' in the database project are comparatively relaxed, and I describe in pretty graphic detail the design of a query processing engine that will achieve an 'A' grade.

However, the online grading system also lists the top-performing groups with their query processing times in sorted order.  Depending on class composition, friendly competitions have emerged between the top-performing groups.  I sometimes reward the best performing groups with baked goods.</p>


CSE 662 Demo Day
2015-12-14T00:00:00+00:00
I'd like to give a shout out to the students in CSE-662 and CSE-749 for a fantastic demo day.  Luke, Karthik, and I have been receiving tons of positive feedback about your energy, enthusiasm, and the general awesomeness of what you did.  I hope y'all had as much fun with it as we did.</p>
Also, Aziz Mohainsen took some great photos of the event</a>.  Thanks Aziz!</p>


Stratos Idreos at UB-CSE
2015-10-30T00:00:00+00:00
For those asking about Stratos' slides, you can find them here</a>.</p>


In case iCloud login breaks your computer
2015-10-23T00:00:00+00:00
This was discovered through trial and error, rather than through web searches... so for anyone who's run into this problem, here's what worked.</p>
First, the problem.  I recently reset my iCloud password.  I was foolish enough to have my account set up for iCloud logins.  This... was a mistake.  After I changed my password, iCloud pushed a password update to all my computers... a broken update.  In other words, I lost the ability to log in or authenticate to anything.</p>
Take note!  If you have iCloud login enabled, disable it or risk being hosed like this.</p>
Now the fix.</p>

	You'll need a bootable USB key or flash disk.  You can use the hackintosh boot disk creator over at tonymac</a> to create one of these.  You may be able to use the recovery partition (hold down cmd-R while starting up)... but this seems to have mixed results.  It works on some hardware configurations, but not on mine.  Either way, you want to get to the installer/recovery partition boot screen.</li>
	
In the Utilities menu, choose 'Terminal'</li>
	
Type `resetpassword` without the quotes.</li>
	
Select your boot partition</li>
	
Select your account from the pop-up</li>
	
Pick a temporary 'recovery password'.  This can be anything, you'll only need it for the next ten minutes or so.</li>
	
Type in your recovery password twice (into the password and verify fields).</li>
	
Click save</li>
	
Now the important part.   Make sure your computer can not connect to the internet</strong>.  Unplug your network cables, disconnect your router, go into a faraday cage if you have to.  Avoid internet connectivity at all costs.</li>
	
Back in the terminal window type `reboot` without the quotes</li>
	
Use your recovery password to log in, regardless of what the login prompt tells you.</li>
	
Open the User Accounts preference pane, and change your password again</strong>, this time to a permanent password.  Be sure to unlink the password from iCloud if asked.  If asked for your iCloud password, use your recovery password instead.  This is your iCloud password as far as your computer knows.</li>
	
After you reset your password again, and after you're sure you're unlinked from iCloud, you can connect to the internet once again.</li>
</ol>


Oliver @ CSE 501
2015-09-17T00:00:00+00:00
Sides can be found here</a>.</p>


PocketData @ TPC-TC
2015-08-31T00:00:00+00:00
I presented our work on Pocket Data</a> today at TPC-TC</a>.  See the slides here</a>, and download the dataset here</a>.  Many thanks to Jerry, Geoff, Luke, and the entire PhoneLab project for making this paper happen.</p>


What if Databases Could Answer Incorrectly?
2015-08-13T00:00:00+00:00

(an open letter to the database community)</p>
For as long as databases have existed, they have held themselves to an invariant.  This invariant has become so ingrained into the psyche of database theoreticians, researchers, and designers that even the few who have tried to break it have only done so with cumbersome data models, by involving huge warning signs, or by using similarly obnoxious user interfaces.  The invariant that I'm talking about is that a database must never give the user an incorrect answer</strong>.  </span></p>
Admittedly, this invariant has been broken now and again: Approximate (née. Online) Query Processing uses sampling to satisfy user-provided bounds, Probabilistic and Uncertain Databases work with underspecified data, while Model Databases allow users to query graphical models.  Yet, even in these cases, we as a community feel compelled to force the user to suffer immeasurable pain and anguish for the sin of working with uncertain data.  Probabilistic databases are impenetrable to anyone without a degree in statistics.  Every single AQP system and model database adds</span> arcane syntax to SQL that allows users to specify how much uncertainty they're willing to tolerate, or worse still, requires a magical frontend that screams at the top of its lungs about just how bad the results that it's producing are.</span></p>
Enough is enough!</strong></p>
</div>


Ontology for Insider Attacks @ MIST 2015
2015-08-11T00:00:00+00:00
(A much delayed) Congrats to Gokhan and Shambhu for getting their paper "A Preliminary Cyber Ontology for Insider Threats in the Financial Sector</a>" accepted at MIST 2015.  This paper is part of the Insider Threats</a> project.</p>


1 Month of SQLite Smartphone Logs at TPC-TC
2015-07-08T00:00:00+00:00
To appear at TPC-TC in Hawaii: An in-depth analysis</a> of 1 month of SQLite query logs on PhoneLab</a>.  We found quite a few surprising things... :)</p>
Great job Jerry, Geoff, and Luke, on a great paper!</p>


Lenses @ VLDB
2015-06-06T00:00:00+00:00
Ying and Niccolo's paper on Lenses (camera-ready here</a>) and intuitive uncertainty management was accepted at VLDB 2015. Congratulations to them, and everyone else involved in the Mimir</a> project, including co-authors Ronny Fehling and Zhen Hua Liu from Oracle.</p>


Conference Summary - CIDR
2015-03-13T00:00:00+00:00
CIDR</a> is a bi-anual conference focusing on new ideas and directions for the database community.  Topics presented range from early and mid stage systems efforts, to proposals for radical changes in the direction of database research.  A focus of CIDR is the Gong Show, a sequence of 5-minute talks about literally anything.</p>
Humanizing Data</h2>
One theme running through this year's CIDR was a recognition of the scale and scope of database research and technology moving further and further away from common use cases.  This idea was  especially evident in Jens Dittrich's Gong Show Talk "The Case for Small Data Management</a>", where he argued that the number of organizations actually dealing with petabytes of data in practice was incredibly small and that our efforts would have the biggest impact when targeted at realistic data sizes.  Brown mirrored this vision in  Tupleware</a>, noting that increasingly the limiting factor for most small-scale users was computational complexity and expressiveness rather than data sizes.</p>
There was a significant focus on areas where HCI and Databases could unify their efforts.  Trifacta</a> presented on some work on using predictive modeling</a> to simplify data transformation development, and Google presented their efforts to simplify data integration</a>.</p>
As always, abstractions for data management were quite popular, and we saw even</a> more</a> abstractions that treat probabilistic models as views.  There was even an entire</a> panel</a> on</a> managing and querying knowledge.</p>
Reabsorption of Specialized DBs</h2>
A similar apparent trend was the observation that specialized database systems were no longer needed.  To paraphrase one attendee, we've realized that working with graph data is basically doing lots of self-joins and recursive queries, and realizing that, we can optimize general-purpose database engines to be just as good.  This view manifested in several ways: EPFL's Vida</a> was one of several efforts to create an overlay on top of specialized database systems, creating an abstract, uniform view of the data.  Wisconsin made a case against specialized graph engines</a>, and Oracle presented their approach to dynamically indexing semistructured data</a>.</p>
Data 'Swamps' and Low-Quality Data</h2>
One subject that triggered quite a bit of discussion with the audience was the growing need to manage low-quality data.  The term "Data Lake" was particularly abrasive, as numerous attendees pointed out that without curation, a data lake can quickly turn into a data swamp</a>.  Numerous</a> efforts</a> to improve</a> this curation</a> process</a> were</a> presented.</p>
Other Directions</h2>
A few other directions stuck out.  Super-aggressive, bare-metal query compilation</a> to raw hardware</a> is becoming even more of a thing, and I noticed an increased interest in database security</a>,  access control</a>, and trust</a>.</p>


CSE 562 Syllabus is live
2015-01-14T00:00:00+00:00
Hot off the presses.  See what lies in store for graduate databases</a>.</p>


maybe we got a HotMobile Paper
2014-12-17T00:00:00+00:00
The maybe</tt> statement is a new code primitive that allows developers to harness the power of nondeterminism in their code.  Through a collaboration between three labs at UB, we are building compiler, infrastructure, and analytics support that will help developers to write safer, faster, and more adaptable mobile applications.  See our paper</a> at this year's HotMobile</a>.</p>
Congrats to Jerry, Nick, Anudipa, Anandatirtha, Guru, Sriram, Jinghao, Luke, and especially Geoff for putting together a great paper.</p>


Just in Time Data Structures @ CIDR
2014-11-21T00:00:00+00:00
Adaptive indexing is a promising alternative to classical offline index optimization. Under adaptive indexing, index creation and re-organization take place automatically and incrementally as a side-effect of query execution. Adaptive indexing implementations optimize the index's structure by progressively rewriting it until it converges to a single idealized form such as a sorted array or B-Tree. However, the ideal representation changes over time: An adaptive index that is initially optimal for one workload becomes suboptimal as the workload's characteristics change.</p>
In this paper recently accepted at CIDR</a> we generalize adaptive indexing, adding the ability to adjust the layout and behavior of the index to workload changes even after convergence. This radical just-in-time data structure approach to index construction and maintenance allows for indexes that dynamically adapt to changing workloads. Even with this generality, specialization is still possible. A just-in-time data structure emulates classical adaptive indexing schemes when appropriate, while also being able to adopt a hybrid stance tailored to a specific workload. We show that our approach is feasible and enables indexes that quickly pivot between different behaviors.</p>


Ying @ Systems Lunch
2014-11-02T00:00:00+00:00
Ying Yang will be reprising her VLDB PhD workshop presentation about  "On-Demand Data Cleaning" at the UB Systems Lunch on November 7th.  We hope to see everyone there.</p>


Tentative CIDR 2015 Program Posted
2014-10-22T00:00:00+00:00
The lineup's looking pretty sweet...</p>
http://www.cidrdb.org/cidr2015/CIDRSessions.pdf</span></a></p>


Congrats Niccolo and Jan!
2014-10-22T00:00:00+00:00
Congratulations to Niccolò and Jan on being accepted at SIGMOD 2015 for their paper "Output-sensitive Evaluation of Prioritized Skyline Queries"</strong>!</p>


CSTA @nalytics Workshop
2014-10-17T00:00:00+00:00
Oliver is presenting a workshop on Data @nalytics at The WNY-CSTA Fall Conference</strong>.  Hello to all the high-school teachers in attendance!</p>

	The Presentation</a></li>
	
The Files</a></li>
</ul>


JITDs @ CIDR
2014-10-14T00:00:00+00:00
Our paper on Just-In-Time Data Structures</a> (JITDs) has been accepted at CIDR 2015</a>!</p>


Gathering Data, Interactive Programming, and Analysis
2013-10-20T00:00:00+00:00
Data exploration is an interactive process. Let's say I have a dataset… I want to ask questions about it.  Often though, I'm not going to have a precise idea of what questions I want to ask, even if I do have a vague sense of them.  I want to be able to explore the data.</p>
So what's standing in the way of me doing that?</p>
Gathering the data:</strong> It's possible that the data is not immediately available and needs to be gathered.  Even if I know what I'm looking for, I might not immediately have access to the data that I'm looking for.  Before anything, I need to find the data that I'm interested in, and (if necessary) transport it to somewhere that allows me to compute over it.</p>
Structuring the data</strong>: Data pulled from the outside world needs to be put into a structured form before any sort of automated analysis.  This may be as simple as parsing (e.g., a CSV file), or more complex: I might be able to extract all manner of features from a log file, for example.  I might split based on records, based on lines, or even based on sets of records.  I might be interested in writing a parser that pulls out certain features from the log entry -- the timestamp, the message, or the component causing the alert.  This is a bit of an ad-hoc process -- I may only be interested in specific patterns and subsets of the data now, but that might change as I explore more of the data.</p>
Cleaning the data:</strong> Even after I've imposed some structure on the data, there's no guarantee that the data is 'correct'.  Strange entries, outliers, and missing or corrupted data will make any results I obtain useless.  At this stage, one typically goes through a set of sanity checks, examining schema warnings from the previous stage, asserting constraints like key dependencies, and validating against secondary data sources.  I may also want to apply my domain knowledge; Past experiences may have given me a sense of what could go wrong with my data collection process.</p>
Query processing:</strong> Finally, I'm ready to actually manipulate the data.  This means transforming the data into a form that matches what you need -- merging datasets, rotating/pivoting the data, and/or filtering out entries of interest, for example.</p>
Visualization:</strong> A step in the process that's often associated with this last query processing stage is summary and visualization; Obtaining aggregates, samples, and/or graphical representations of the data is a crucial part of the entire analytical development process.  (1) As I'm gathering the data, I need to be able to see bits and pieces of it so that I can be sure that it's what I'm looking for, (2) As  I'm structuring the data, I want to make sure that my regular expression and/or parsing scheme is correct, (3) As I'm cleaning the data, I want to see/visualize outliers, and (4) obviously, I want to see the results.</p>
 </p>
Really, each of these aspects of analysis is interrelated.  One bounces back and forth between different stages, gathering more data, parsing out more fields, cleaning, etc… A strong analytical pipeline relies on being able to see the data quickly, see results even if they're only estimates, and then go back to iterate on your analysis.  </p>
How do we achieve this?  What kind of interfaces can we build to improve feedback, and to anticipate the user's needs.  What infrastructures are needed to support this kind of anticipatory computation?</p>


Finding truth in the bits
2013-10-07T00:00:00+00:00
What is truth, and what is data?</p>
At the very least, they're different.  Ask any scientist, and they'll caution you about conflating the two: Data includes measurements and observations, mere points and samples of the whole of the universe.  </p>
This may seem a bit philosophical, but my point is that, while there is often a strong correlation between data and truth, the two are distinct.  Even in the best case, when working with data of perfect quality, it represents only a subset of a bigger picture.  And data is very infrequently of perfect quality.  Substantial massaging is often required to get data into a standardized form for analysis.  As data is being massaged, assumptions are often made about the data: Floats are cast to integers, Comment fields are dropped or ignored, Extenuating circumstances or outliers are rolled into the core data.  The data cleaner's assumptions are being applied to interpret the data.</p>
That's not to say that these transformations are bad.  Substantial effort goes into data cleaning efforts.  But the fact is that when you run a query on the database, it's important to realize that what you're getting back is data, and not truth.</p>
It might be nice to have a database that acknowledges this distinction.  </p>
What would such a database look like?</p>
I envision a database with two (or more) layers, each layer providing a view over the layer below it.  The bottom layer would consist of the base data, intact, unchanged, and as-gathered.  The uppermost layer would represent "truth".  The base data is completely deterministic; We know these values precisely, but the values themselves may be wrong or not representative.  As we travel up the levels, we get to progressively lower levels of determinism.  Queries run on the higher levels are guaranteed to provide "true" results, but may emit annotated results, ranges of possible results, probability distributions, or simply say "I don't know."  </p>
The crucial challenge then, is how do we make such a database usable?  How can this process be integrated into a normal data cleaning workflow with minimal changes and/or overheads?</p>


Why text editors are bad for programmers.
2013-09-21T00:00:00+00:00
Let me start with a bit of a history lesson.  For over a decade now, we've known a particularly annoying quirk of Moore's law.  That's the "law" that says that the number of transistors</em></strong> on a chip doubles roughly every one and a half years.  A lot of people, however, interpreted Moore's law as meaning that the speed</em></strong> of processors would double at that rate.  For a while, that was indeed the case.</p>
Then, somewhere around 2005 or so, we hit a roadblock.  The standard bag of tricks for converting more transistors to more speed (e.g., deeper pipelines, redundancy for overclocking) started to run dry.  Mind you, we were still getting more and more transistors on the die, but we couldn't use those to make things faster.</p>
Intel et al. had more transistors.  Since they couldn't make them do things faster, they made the transistors do more.  Enter multicore.  </p>
This scared a lot of people.  After all, a lot of people had been banking on the false interpretation of Moore's law.  After all, if it runs slow now, it'll run just fine in 2-4 years.  With CPU designers shifting their emphasis to multicore, getting that kind of speedup meant reorganizing your code to run in parallel.  The natural speedups of yesteryear were no longer "free", and the research community shifted to ways of exploiting natural parallelism in user code.</p>
And that brings us to the present, as well as my thought of the week.  Text editors are fundamentally bad for programming.</em></strong>  </p>
I know that sounds a bit radical, but hear me out.  The fundamental data representation of a text editor is serial: A string of instructions.  For a serial program, this is perfect.  The order is there, and the computer knows exactly how to execute these instructions serially.  A text editor encourages people to think serially about their code.  For parallel programs, however, this is a horrible idea.</p>
What we as the research and software development communities need to explore are non-linear approaches to representing code. Graph-based data-flow diagrams are a start.  For example, one (admittedly crude from a full development standpoint) nonlinear programming environment is Yahoo Pipes.  </p>
Nonlinear features can be increasingly found in IDEs and programming models as well: Eclipse's "Go to {Definition, Call Sites, …}" features (now present in nearly every other major IDE) are canonical examples of this, making it easier to mentally trace nonlinear code execution paths. Models like Map Reduce compartmentalize parallel computations (each Mapper/Reducer is a separate class), forcing a developer to consider them as individual components of a bigger program.  </p>
Now, that level of serial thinking is still necessary.  CPUs still operate one instruction at a time, but can we do better?  Can we create a programming environment that actively encourages users to compartmentalize their computations?</p>
Consider the following simple program</p>
A = input.right;</pre>
foreach(i in input.left){ B += i.left; C += (i.right > 0?i.right:i.left) }</pre>
B += input.right;</pre>
And compare to the following structure: </p>
</p>
The parallelism is inherently visible, and easy to follow -- even if the rest of the graph may not be.</p>
How can we replace the text editor?  How can we come up with better ways to represent data flow in a computation.  Can we take cues from programming environments like Alice or Apple's Automator?  Can we create a non-linear text editor… something that inherently displays the branching structure of a program?</p>


Log as a Service (Part 2 of 2)
2013-07-28T00:00:00+00:00
A few weeks ago, I started introducing Laasie, our new system for building powerful collaborative web applications.  We introduced the primitive interface for managing state -- state land.  This week, I'm going to provide a quick introduction to Laasie's more powerful abstraction for manipulating state -- log land.  </p>
Log Land</h3>
Laasie represents application state not just in terms of its precise value at any given point in time, but also as a DAG of state changes.  Any DAG can be resolved (reified) into a concrete state by treating the DAG as a partial order, and evaluating the state updates in a compatible total order, starting with an initial "empty" state.  This particular representation of application state is quite powerful, as it allows us to access the full history of the application's state. We can track who made a change, when it was made, and what it depends on.  </p>
More precisely, every time an update is written to Laasie, a new log entry is recorded for it.  Laasie then creates a set of pointers from the log entry to all log entries on which it depends, and can establish additional pointers if necessary (e.g., to an undo record, or to the last log entry for the value being modified).  The resulting set of log entries and pointers forms the Log DAG.</p>
So what does the Log DAG buy us?  Well, it's a more powerful way of doing log analysis.  Typical graph properties such as (conditioned) reachability, isomorphism, cyclicity, and min-cut arise quite frequently when discussing optimal management of application state.  Using this simple abstraction allows us to create a single data management system capable of encoding a broad range of application-specific optimizations.</p>
We're currently exploring analysis in log land using SparQL.  It turns out that a surprising number of properties can be mapped directly into SparQL with only a very small BarQL equivalent.  These include properties like reachability from the root (i.e., for garbage collection), commutativity, and recoverability (i.e., for operations like merges).  </p>
We're on the verge of releasing an analysis tool for Laasie-generated logs, the first step towards both online and offline optimization of application state in Laasie.</p>


Expressiveness vs Efficiency
2013-07-21T00:00:00+00:00
There's an odd dichotomy that hit me recently.  On the one hand, there's been a big recent push towards DSLs, or domain specific languages.  Examples include Bloom (distributed computation for monotonic programs), the many DSLs implemented in Delite (which include things like matrix computations, ML algorithms, etc…), GraphLab, GraphChi, and so forth.</p>
On the other hand, people continue to want more expressive languages.  We keep adding more features to things like SQL (which has been turing-complete for the last few years).  I understand this drive.  We want to be able to efficiently capture more ideas.  This idea of abstracting concepts is what computer science is all about.</p>
As I was discussing the design of an indexing data-structure with one of my students the other day, the weight of dichotomy really hit me.  We were discussing building more and more corner cases into the data-structure (or rather into objects that we were indexing).  This struck me as a bad idea, since I really hate corner cases.  On the other hand, a critical feature of the indexing data-structure was the ability to perform set-containment on the objects we were indexing.  </p>
As many of you know, if you allow a set description language to get too complex, set containment can easily work its way into NP or even intractability.   So there it was: a conundrum.  On the one hand, a complex language would give us more flexibility, and on the other, if we made it too complex, using the indexing structure would cost more time than it saved.</p>
That got me thinking.  Many problems that are intractable on a turing-complete language become feasible on certain well-defined subsets of the language.  In fact, they may even be tractable on multiple well-defined subsets, potentially multiple non-overlapping subsets.</p>
And that's where DSLs come in.  A DSL allows you to specify a restricted form of a language that's far more amenable to optimization, analysis, and other useful features than a fully general language like C, Java, Python or Ruby.  </p>
Often, the DSL doesn't even need to live outside the confines of a general language.  Bloom has a Ruby-based implementation that exposes the full (turing-complete) power of Ruby for those program fragments that can't be easily expressed in Bloom's framework.  Scala has a Sql compatibility layer that transforms a specific fragment of Scala into equivalent relational operators (similar to VC++'s Linq, but more tightly coupled with the language).  </p>
This… this is super cool, because it suggests that different DSLs can live and cooperate in the same language (you see some of this in the Delite framework already).  It also suggests that certain fragments of the language might translate naturally into a corresponding DSL's infrastructure.  Why is this cool?  Because it means you might be able to get the best of both worlds -- expressiveness and efficiency.  </p>
Imagine a language that could automatically analyze your program to identify the specific language fragment best suited to encoding it.  Although there might be some cost-estimation factors to help decide between multiple different language fragments, this actually seems like it might be doable with pure static analysis.  Such an analysis tool might also be able to identify trouble spots in your program -- point to specific operations that prevent it from descending into a specific program fragment.</p>
Just a thought.</p>


SIGMOD Wrapup
2013-07-12T00:00:00+00:00
This year's SIGMOD/PODS was quite exciting.  Attended by over 800 students, researchers, and members of industry, the DB community is more vibrant than ever.</p>
The highlight for me was a new event at this year's PODS, a panel discussion on future trends in Database research</a>.  Many of the speakers discussed specialized forms of data processing where creative ideas were needed: a particularly impassioned plea came from Andrew McCallum, who argued for a tighter coupling between the database, machine learning, and data mining communities.  This sentiment was echoed by a number of the panelists, who suggested that database researchers had dropped the ball on the challenge of "Big Data", allowing it to be defined almost exclusively in terms of data-mining and systems challenges.  Social Graph Databases, Astronomy (e.g., Skyserver), and similar projects were put forth as areas where peta-scale (or larger) query processing are critical.  </p>
Joe Hellerstein made some interesting points that I saw echoed throughout the remainder of the conference: He mentioned the almost obvious parallel between communication and storage, namely that communication is a form of messaging to the future.  The primary distinction lies between who is responsible for what -- In storage, the sender is responsible for doing the work to put the message/signal someplace where the recipient can easily retrieve it.  Conversely, in communication, the recipient is responsible for listening and waiting for the message/signal to show itself.  Parallels exist throughout the DB community, query processing vs stream processing, being the obvious example.  I saw this sentiment echoed throughout the conference, as papers like the latest PIQL offering</a> suggested the need to revisit the tradeoffs between pre-computation and online query processing.  </p>
A third theme that arose both at the panel discussion and throughout the conference was consistency.  Between Joe's CALM conjecture</a>, an excellent tutorial by Phil Bernstein and Sudipto Das and other chatter throughout the conference, it seems clear that consistency and the CAP theorem are once again rearing their ugly heads.  The key takeaway from all of this seems to be that each application has different consistency requirements, and the underlying platform needs to establish a clear, understandable contract with the programmer about what "consistency" means.  Also clear from all of this is that consistency requirements vary between applications.  Through DSLs and other platforms, we are once again talking about how to figure out what kind of consistency an application requires</p>
</p>
Hardware continues to be a growing trend, and over the past few years, I've been seeing a shift towards (Eric Sedlar's prediction of) specialized hardware for databases.  An interesting point in this space is a measurement paper out of EPFL</a> where it is observed that instruction cache misses are a major bottleneck in query processing.  Pinar's suggested solution to this is that we devote individual cores to specific tasks that fit entirely into an instruction cache.</p>
I've been seeing a lot more effort on crowdsourcing.  In particular, the field seems to be shifting towards more specialized forms of crowdsourcing -- focusing the crowdsourcing efforts on domain specialists and data mining the results of such queries.  One paper on crowdmining</a>, discussed efforts to infer causal connections and trends in data by querying users for instantiations of these trends.</p>
And that's all...  pretty jazzy if I do say so.</p>
 </p>


Log as a Service (Part 1 of 2)
2013-06-27T00:00:00+00:00
Last week I introduced some of the hype behind our new project: Laasie.  This week, let me delve into some of the technical details.  Although for simplicity, I'll be using the present tense, please keep in mind that what I'm about to describe is work in progress.  We're hard at work implementing these, and will release when-it's-ready (tm blizzard entertainment).  </p>
So, let's get to it.  There are two state abstractions in Laasie: state land, and log land.  I'll address each of these independently.</p>
State Land</h3>
State land is what application developers interact with directly, and is most easily thought of as a big JSON object.  Those familiar with MongoDB, Pig Latin, or JaQL should feel right at home here.  However, Laasie provides a powerful set of abstractions for developing collaborative web applications.  That is, although it has a RESTful API, Laasie's true power lies in its state replication and programatic update features.  Let's see what they can do.</p>
Reads</h4>
In a normal REST API, reads are performed by a client specifying a key (or equivalently a path) of interest.  The infrastructure obtains the key, passes it to the client, and the interaction is complete.  </p>
Object read(path)</pre>
Laasie on the other hand, is designed for state replication.  In the Laasie model, reads operate (conceptually) in three stages:</p>

When a client first connects, it requests a session token by providing a path of interest and any relevant authentication tokens (e.g., username/password).  </li>
Using the session token, the client initializes its state.  This is analogous to a RESTful read, except that the requested value is returned along with a state token (effectively a timestamp).</li>
Using its session and state tokens, a client can request an update: a javascript function that, if executed, will transform one version of the state into the next.  This returns a new session token.</li>
</ol>
SessionTok createSession(path, client_identity)</pre>
{Object, StateTok} initSession(SessionTok)</pre>
{function(x) -> newx, StateTok} updateSession(SessionTok, StateTok)</pre>
The update function is typically going to be smaller than reading the entire state from scratch, making this an ideal way to keep a clients up-to-date.  Also note that we can use blocking HTTP requests to support PULL-style functionality in updateSession, while keeping control over updates in the hands of the client.  This is crucial for disconnect-heavy settings like mobile computing, where browser-based apps are extremely common.</p>
We plan to develop client-side libraries (e.g., in Javascript) to simplify the task of state maintenance.  Such a library will essentially maintain a local copy of the requested object and manage updates. </p>
Writes</h4>
Like reads, Laasie exposes a more powerful write API.  Laasie allows developers to express updates as functions.  Although we expect many of these functions to be simple (overwrite value X, add 2 to value Y), the API is actually quite powerful, and we plan to add more features, domain-specific extensions and DSLs over time.  The full extend of this language is more than I want to get into in this post, but if you're familiar with Pig Latin or JaQL, you should feel right at home.  </p>
An important feature of Laasie is that these update functions are transmitted and stored as-is in Laasie (Laasie doesn't typically evaluate them).  Instead, Laasie uses the update's semantics to identify and evaluate potential optimizations.  Next week, I'll get into how Laasie does this, and show how infrastructure managers can use Laasie's second abstraction: log land, to extend Laasie's optimization capabilities with application-specific optimizations.</p>


Laasie: Building the next generation of collaborative applications
2013-06-22T00:00:00+00:00
With the first Laasie paper</a> (ever) being presented tomorrow at WebDB (part of SIGMOD), I thought it might be a good idea to explain the hubbub.  What is Laasie?</p>
The short version is that it's an incremental state replication and persistence infrastructure, targeted mostly at web applications.  In particular, we're focusing on a class of collaborative applications, where multiple users interact with the same application state simultaneously.  A commonly known instance of such applications is the Google Docs office suite.  Multiple users viewing the same document can simultaneously both view and edit the document.</p>
For Developers</span></strong></p>
The goal of Laasie is to provide an infrastructure on which the next generation of collaborative applications can be built.  For developers, this means that the infrastructure should fade into the background.  The entire development process should proceed (almost) as if one were writing a single-site application.  To use the MVC paradigm as a basis, Laasie acts as the M(odel), persisting your data and making sure each client has a shared view of it, and making sure that clients can revive themselves after the fact.</p>
Not only does Laasie make it easier for you to get your collaborative application off the ground, it also provides a range of useful features.  In addition to some fun access control, sanity checking, and sandboxing capabilities, our eventual goal will be to provide support for distributed Laasie instances.  End users requiring offline support, added privacy, or similar features will be able to instantiate their own Laasie instances, which will "just work" with your application.  </p>
For Researchers</span></strong></p>
The primary challenge of providing such an infrastructure is the question of how we represent state updates.  The more general you get, the harder it is to be efficient.  </p>
To wit, we could transfer the full state on every single update (this is roughly what Dropbox does).  This is certainly quite general, and allows us to express any sort of state change that we like.  On the other hand, it's a bit hard to implement efficiently.  This is why you don't see many distributed applications that use Dropbox for this purpose (as a shared filesystem perhaps, but not for low-latency sharing).</p>
At the other end of the spectrum, there are a whole range of optimizations you can implement.  Knowing that two operations are commutative (or that there's an applicable operational transform) creates a simpler, leaner, more efficient consistency model.  Being able to subdivide an application's state allows client instances to pull only relevant data, or changes to fragments of the state.  Bulk changes to structured data (numbers, collections, matrices, images) can often be transmitted more efficiently as a description of the change (add 1 to every number in this collection).  You could create an infrastructure that was super-optimized and tailored specifically to your application.  Unfortunately, then you've tied the infrastructure to your application's semantics.  If those semantics change (e.g., you add features), you need to change the infrastructure.  </p>
The core insight of Laasie is that functions (aka procedures, aka monads, etc...) are a way of representing state updates that is both general (not turing complete yet, but we're getting there), but still amenable to optimization.  Because the full application semantics are expressed in the update, it is possible to analyze each update, assert properties about updates, and more generally, to restructure and optimize the overall state representation.</p>
More on this next week, when I introduce the Log as a Service state representation.</p>


Are you sure?
2013-06-16T00:00:00+00:00
In the last 5 years or so, we've experienced a dramatic shift in how we interact with computers.  As early as the late 90s, we had fairly reasonable speech-to-text and speech-to-command software.  Now, though, we've seen tools from Yahoo, Microsoft, Google, and most recently (and publicly) Apple's Siri that allow us to make perfectly arbitrary verbal requests of our computers, and have them be answered.  </p>
Still, this interaction remains mostly unidirectional.  The user makes requests, Siri et al. go out and fetch the responses and present them to the user.  What could we do if we had more, if our computers had the ability to come up to us and as us</span> for information.  For example, I could ask my computer to make me a reservation at a nearby restaurant at around 6, and to invite a few of my friends.  </p>
Granted, there are systems integration issues here -- the restaurant and all of my friends need to be using the same (or at least compatible) scheduling systems.  That's not necessarily out of the question -- CalDAV has evolved as a pretty reasonable scheduling exchange system, and there's room for some upstarts to come in and create a compatibility layer between iCloud, exchange, google calendar, and other related systems (this would be really frigging cool if someone were to do it).  </p>
Let's put that aside for now, and look at the core challenge of answering the question itself.  Scheduling is a huge (worse than NP) problem, largely because it's hard to convey every nuanced detail of a person's preferences and expectations.  There's a degree of uncertainty that comes from everything a person says.  When I ask for a reservation "around 6", it may be reasonable for that reservation to occur at 7.  When I ask for a nearby restaurant, what does that mean?  Walking distance, biking distance, or driving distance?  </p>
How do I specify which friends I'm looking to meet?  Clearly I don't want my computer going through my entire address book.  Once I've specified them, there's uncertainty.  My friends might not be able to make specific times.  "Maybe" has to be a perfectly reasonable answer to the question of "Do you want to meet at 6".  The computer now has to take this into account.  The computer can create a set of different possibilities.  Making it even worse, it may well be the case that none of the possibilities fulfill the stated objectives.  If two or three friends have mutually exclusive schedules, one of them will need to be dropped.  Now there's multiple possibilities for how the stated objectives can be relaxed.  </p>
Ultimately, this boils down to three significant problems:</p>

When the user asks the computer an open-ended question, how can degrees of freedom in the query be extrapolated.</li>
When the user asks the computer an open-ended question, how can the degrees of freedom be prioritized (i.e., can we extrapolate a cost curve for each degree of freedom)</li>
When the user asks the computer an open-ended question with no possible answers (or the user asks for more possibilities), how can we infer additional degrees of freedom.</li>
</ul>
The field of preference databases tries to address efficient query processing when there are degrees of freedom like this, but most of this work assumes that a structured query (and cost curve) is (are) already available.  How do we impose this kind of cost model on the query?  How do we infer it from the user's verbal statement?  </p>
Let's take this in another (related) direction.  What happens when the computer needs to know something from you.  Say you're one of the friends and are being asked whether you can make a 6:00 dinner appointment.  Maybe you're interacting with the computer to diagnose an issue (e.g., with your car).  What happens when computer asks you a question, and you don't know what the answer is.  I don't know is a reasonable answer.  I don't know, but I will know in 30 minutes is another.  There are a range of answers "Maybe, Possibly, I think so, I don't think so etc.." all meaning that there are two possible outcomes.  How can these possibilities be effectively communicated to the originator of the query.  If you're diagnosing car troubles, how does the computer deal with this.  It's a different class of information than "No".  There's some work here for the NLP community -- Can we quantify the level of uncertainty associated with a qualitatively uncertain statement of fact?</p>
There's another class of responses to such questions.  A reasonable response that a friend might give is "If I'm still available by 4:00."  In effect, the user has provided an uncertain answer, but one with a specific resolution strategy.  At the current moment, the answer is uncertain, but at 4:00, the database is triggered and springs into action, resolving the uncertainty and creating a new set of constraints. </p>
Anyhow, these are just some random thoughts on a pretty cool problem space.</p>


Never tell me the odds
2013-05-24T00:00:00+00:00
A while back, I had a series of articles on probabilistic databases, and shortcomings thereof.  As a quick recap, probabilistic databases are databases that allow you to express data in terms of probability distributions instead of precise values.  Such representations have a number of potential applications, such as developing and analyzing hypothetical "what-if" scenarios, or avoiding information loss due to errors in data (e.g., if the data comes from OCR software).</p>
One of the conclusions that I reached was that people don't like working with probabilities.  Qualitative results are typically more meaningful to an end-user than quantitative ones.  Worse still, unless your data comes from some sort of automated source (like OCR software), how probabilities should be assigned is often unclear.  This is something statisticians get paid big money to do.  Expecting end-users to arbitrarily assign probabilities to data that they're not completely certain about is silly.</p>
So... where does this leave us?  Well, fortunately, a lot of work in the probabilistic database area (especially more recent stuff like [1,2,3]) leaves the exact nature of the underlying probability distributions open to the end-user.  Conceptually, there's nothing to stop us from sticking something more qualitative in its place.  The question is what?</p>
Here's one thought.  Users may not have a good sense of assigning precise probabilities, but they can certainly tell you whether a data value is definitively correct, or just a guess (maybe even something more, like an "educated guess" or a possibly incorrect fact", but let's keep things simple).  In fact, you can get lots of users to give you this kind of information -- different users might even have differing guesses or "definitive" values.  When queries are posed on the data, you might get many possible outputs -- different guesses (or definitive values) can each produce a different query output.  Now each output can be annotated with the set of users who support (or contradict) it.</p>
This effectively forms a lattice of outputs, providing at least a partial order over outputs.  We can do things like give a skyline of the most likely answers.  We can use techniques like web of trust to find answers from people a user is likely to support, or use various measurements of past accuracy to identify users who are likely to provide accurate guesses.  If we have a way of validating guesses (e.g., ground truth eventually becomes available), users can also be ranked.  Low performing users might even be identified and contacted with suggestions about how to improve their guesses.</p>
----------</p>
[1] Green, T.J. et al. 2007. Provenance Semirings. (New York, New York, USA, 2007), 31–40.</p>
[2] Huang, J. et al. 2009. MayBMS: a probabilistic database management system. (New York, New York, USA, 2009), 1071.</p>
[3] Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. (2010), 157–168.</p>


Languages with first-class ORM primitives
2013-05-04T00:00:00+00:00
I was at a seminar on Object Relation Mappers</a> (ORMs) recently.  The idea behind these is actually quite simple: they're a persistence layer for object-oriented languages.  Through a little bit of glue code injected into the language's runtime engine, and some introspection tricks, object instances are transparently mirrored to a persistence layer like a database engine.  </p>
Having a database sitting behind an ORM, actually provides some nifty functionality.  In particular, you can do nifty things like pose queries over object instances, classes, and so forth.  Often, these queries can be posed in a database-agnostic way (i.e., without using SQL).  </p>
This is quite handy, since it gives object-oriented developers the power and optimization tricks of a declarative query processor.  For example, in an application that manages a school's student population, you might have a relation that represents all of the students mapped to instances of a "student" object.  The student object exposes functionality that might be performed on a student (i.e., register, etc...), and has access to all of the data available about the student.  The developer can actually pose queries, and get all of the object instances that satisfy some predicate (e.g., all students with a GPA > 3.5).</p>
That got me thinking.  I've seen this before.</p>
set myFiles to the documents of the application where the name of the owner is "Oliver"</pre>
This is an example of a language called Applescript, Apple's answer to shell scripting back in the 80s.  The language still exists, and is occasionally used for automating tasks on OSX -- most often as a wrapper around shell scripts (If you've ever set up a Raspberry Pi on a mac, you know what I mean).  </p>
The clever thing about Applescript is that the language includes first class query primitives.  A large fragment of the language actually has a direct correspondence to relational algebra.  For example, the above Applescript code fragment could be rewritten in SQL as:</p>
CREATE TEMPORARY VIEW myFiles AS</pre>
  SELECT * FROM application.documents WHERE owner.name = "Oliver" INTO myFiles;</pre>
Applescript is designed to work very closely with applications.  Each application and/or system component provides what's called a "Dictionary" which includes nouns (object classes) and verbs (object methods).  That is, Applescript allows applications and system components to expose objects via predefined schemas.  These objects can be queried just as easily as a normal language would operate on them.</p>
I'd like to see more such things.  Even now, ORMs feel like they're bolting query operations onto the language, as sort of a hack.  This is true even for ORM-like functionality in DSL-friendly languages like Ruby (e.g. Ruby on Rails).  It seems like this sort of query functionality needs to appear in the language from the ground up -- all the way from the design of the grammar.  </p>
Getting anew language, a sort of successor to Applescript that supported this kind of functionality would be awesome, especially if it could tie into an existing language like Java, Python, etc...  Especially if it could tie into ORM functionality, connect to a database, and do all sorts of other tricks like that.  That... would be really cool.</p>


Semantics as Data
2013-04-09T00:00:00+00:00
Something I've been getting drawn to more and more is the idea of computation as data.  </p>
This is one of the core precepts in PL and computation: any sort of computation can be encoded as data.  Yet, this doesn't fully capture the essence of what I've been seeing.  Sure you can encode computation as data, but then what do you do with it?  How do you make use of the fact that semantics can be encoded?</p>
Let's take this question from another perspective.  In Databases, we're used to imposing semantics on data.  Data has meaning because we chose to give it meaning.  The number 100,000 is meaningless, until I tell you that it's the average salary of an employee at BigCorporateCo.  Nevertheless, we can still ask questions in the abstract.  Whatever semantics you use, 100,000 < 120,000.  We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.</p>
By comparison, an encoded computation carries its own semantics.  This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation.  But this doesn't stop us from asking questions about the computation.</p>
 </p>
The Computation's Effects</h3>
The simplest thing we can do is to ask a question about what it will compute.  These questions span the range from the trivial to the typically intractable.  For example, we can ask about…</p>

… what the computation will produce given a specific input, or a specific set of inputs.  </li>
… what inputs will produce a given (range of) output(s).  </li>
… whether a particular output is possible.  </li>
… whether two computations are equivalent.</li>
</ul>
One particularly fun example in this space is Oracle's Expression type [1].  An Expression stores (as a datatype) an arbitrary boolean expression with variables.  The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement.  Notably, Expression objects can be indexed</strong> based on variable valuations.  Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.</p>
I find this beyond cool.  Not only can Expression objects themselves be queried, it's actually possible to build index structures to accelerate those queries.</p>
Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables.  Indeed, the concepts are almost identical.  A C-Table encodes the semantics of the queries that went into its construction.  When we compute a confidence in a C-Table (or row), what we're effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.</p>
 </p>
Inter-Computation Relationships</h3>
Another class of questions is how different computations, or computation fragments relate or interact.  For example, we can ask about…</p>

… what the algebraic properties of a computation are (i.e., do two computations commute)</li>
… what the dependencies of a computation are.</li>
… given a sequence of computations, what does the information flow graph look like</li>
… given a sequence of computations, does a specific pattern exist, and if so on which computation fragments?</li>
</ul>
This is an area that has not been explored quite as extensively.  Distributed computing has looked long and hard at some of these questions (i.e., when do operations commute), but almost always in a specific context.  Probably the closest idea, spiritually, appears in systems like Delite [2]. These sorts of compiler generation tools allow users to establish semantic restrictions on a domain specific language that lead to powerful optimizations.  In a sense, these kinds of queries regarding computation interactions are also a form of optimization... but more general.</p>
 </p>
Combining Computations</h3>
Ultimately, one of the biggest distinctions between computation and normal data, is that it's possible to easily combine computation.  Computation representations such as Monads</a> are explicitly designed for this, but even simple iterative programs can still be concatenated.  Computations can be broken apart, stitched together, sliced, diced, and sorted every which way... and the result of each is still more computation.</p>
 </p>
Summary</h3>
Where is this leading?  Nowhere specific.  We have a variety of tools and techniques for expressing computation, and now we need some tricks and techniques for effectively querying them as well.</p>
 </p>
References</h3>
[1] Gawlick, D. et al. Applications for expression data in relational database systems. 609–620.</p>
[2] Chafi, H. et al. 2010. Language virtualization for heterogeneous parallel computing. ACM Sigplan …</em>. (2010).</p>
 </p>
 </p>
 </p>


The Analizerificationist
2013-02-11T00:00:00+00:00
There's been a lot of talk lately about "wisdom of the crowd" and "tapping the collective consciousness" and the like, so I figure I might as well weigh with my 2c, by expanding on an idea that came recently in a conversation I had recently with one of my colleagues Jan Chomicki and his student Ying.  (Credit should also go to Dieter Gawlick and Zhen Hua Lu of Oracle, who provided inspiration for this discussion)</p>
Recently, especially in high profile events like the US presidential election, classical political punditry has been getting supplemented (and even in some cases replaced) by data mining algorithms.  Powerful, and often quite accurate algorithms exist to predict anything from elections, to ball games, to the stock market, to what you will be doing next Tuesday evening at 6:41 PM.  </p>
Yet, in spite of the daunting array of algorithmic predictors that exist out there, there's still more to be done.  Data mining is almost more of an art form than a science -- Yes, there are practical, general purpose techniques for finding correlations, outliers, and other interesting features of datasets, but ultimately, you need to know (or at least have a general sense of) what you're looking for.  A lot of the beautiful work in data mining lies in finding clever ways to apply the general techniques to specific datasets.  </p>
So... where does the wisdom of the crowd come in?  Well, let's start with tools like Google Fusion Tables</a>, or Yahoo Pipes</a>.  Here, we have a pretty nifty mechanisms for doing data extraction, and analysis, even dataset lookup and organization.  Can we do any better?</p>
What's missing from these systems is a way of organizing the derivation process.  So you've created a great visualization, and maybe you've even shared it with your friends.  Now how can we take your efforts and use them to benefit even more people?  </p>
Let's say you have an idea.  You think you know exactly how to predict the next election, but it will require a lot of data.  What do you need to do?  Well, first, you'll have to find and/or extract all that data from content on the internet.  Here, fusion table and pipes have you covered.  There are some fairly high-quality datasets available, as well as some nifty tools for getting useful data out of the interwebs.  But now that you have it, you'll still need to massage it a bit.  </p>
Fortunately for you, it's quite likely that someone else has had to do data manipulations on similar datasets.  It would be quite useful to have a system that could point you towards such efforts on the part of other people so that you might base your own efforts on theirs.  As an added benefit, it might be possible to piggyback on the computational efforts already expended for the prior attempt(s) at massaging similar datasets.  </p>
Now that the data is in the right form to be analyzed, there's still that pesky analysis to be done.  Here, once again, the system has the potential to help.  What questions have other people asked about similar data?  What kind of aggregate values might be useful.  What kind of visualizations might be appropriate.  Are there mash-ups that people have assembled out of similar data (google maps as the most general example).  What even qualifies as "similar" data?</p>
In fact, this works from both directions.  Let's say you know what kind of information you're looking for.  How could you ask the system for strategies that other people have applied to get similar answers?  How would you even indicate what you're looking for to the system in the first place?</p>


Using Constraints to Define "Correctness"
2013-02-02T00:00:00+00:00
Data curation is the act of taking data and massaging it into a form that can be analyzed.  There's a common saying among DBAs and Librarians that data curation is the biggest time sink of data management.  I can certainly cite a number of examples of this.  My wildlife biologist girlfriend spends almost as much time organizing and inputting data as she does out in the field collecting it, or analyzing it.  The kind folks working on the DBLP do nothing but data curation.  If you squint a bit, data mining can be thought of a specialized form of data curation, where signals (usable/analyzable data) is extracted from noisy, messy data.  </p>
In short, this is an area that a lot of people spend a lot of time worrying about.  It's also an area on which a lot of people have expended a considerable amount of effort.  </p>
Why is that?</p>
Although Data curation is an extremely repetitive task (suggesting that it might be ideal for computers), the kernel of this repetition, the very heart of this task is something entirely nontrivial: data validation.  </p>

Do both of these datasets contain information about John Doe?</li>
Is John Doe the same person as Johnny Doe?</li>
Is "The House at the End of the Row, Birminghamshire, England" a valid address?</li>
How do I deal with John Doe not having a home phone number?</li>
</ul>
When analyzing data, just like writing code, we make certain assumptions about the data.  For example, "This dataset contains one row for each unique individual".  If these assumptions are invalid, then our analysis will be incorrect.  In addition to getting the data into the right, readable format, its primary task is to ensure that the assumptions that analysts make about the data are valid.</p>
Of course, this requires us to explicitly declare these assumptions.  Databases have a mechanism for this, called constraints (e.g., Primary Key constraints, Foreign Key constraints, Validation Triggers, etc...). However, even these are flawed.</p>
Let's take the example I mentioned just now: "This dataset contains one row for each unique individual"  This is a nontrivial example to encode.  How does a database figure out whether two individuals are identical?  "Joe" and "Joey" could be different names for the same person.  Deduplication is something people have studied for a very very long time, and even now they don't have a particularly good solution that's 100% correct all the time.</p>
Moreover, what happens when the database detects such a violation.  The typical solution is for the database to simply reject any insertion that would cause the constraints to be violated.  </p>
This typically annoys users, who, at least in the short term, just want to load their data into the database.  Consequently, constraints are used quite infrequently, and then, usually only for values that the database itself generates (e.g., entity ids/counters).  Specifying more complex constraints is just out of the question.</p>
Although constraints automate the data validation process, they are insufficient.  There's a clear tension between how tightly we specify these constraints (e.g., An address is always a number, followed by a street, followed by a newline, followed by a city, etc ...), and how usable the database is.  Extremely tight constraints are convenient for the analysts and database programmers, since they can make stronger assumptions about the data... but make it difficult to insert data into the database (and hard to handle corner cases, like addresses in rural England).  Weak constraints are the exact opposite.  </p>
Finding the right balance between strong and weak constraints is hard.  It's a large part of data curation.  How much do you automate, and how much do you put on the analysts?</p>
Is there a middle ground?  Are there other ways of creating constraints that are strong enough to satisfy the analysts, but that don't make inserting data into the database a miserable experience?  How do we alert the analysts when a corner case appears in the database that violates their assumptions about the data?</p>


Procedural Stories and Constraint Satisfaction (Part 2 - Plot twists)
2013-01-22T00:00:00+00:00
After a 1 week absence due to school starting, this week we return with more on procedural story generators for games.  In my last post, I introduced the general idea of procedural story generation as a constraint satisfaction problem.  Here, I introduced the idea of lazy evaluation -- where you generate only the information relevant to the current story.  As we'll see in a moment, this can actually be a lot of information.</p>
Nominally, one would envision interactions with a procedural generator as being entity-driven.  When the player(s) interact(s) with an entity (a town, a character, etc...), information about that entity is generated and extrapolated.  This is fine for a static world, but for the world to be interactive, things need to change.</p>
Furthermore, as with any game, players should be involved in the story.  They have the ability to interact with and affect the story.  However, just raw automated AI characters are unlikely to generate sufficient drama to keep players engaged.  We want stories to evolve in a way that keeps the players engaged -- generally by providing some direction to the story.  One possible approach to this is to inject certain drama-inducing plot fragments into the mix.  Plot twists, as it were, that spice up the story ("Your friend Bob is actually a foe" kind of things).  </p>
As any writer will tell you, a good plot-twist needs set-up.  It needs backstory.  Why is Bob angry at you?  Bob should have shown signs that, while maybe not immediately apparent, could indicate that he doesn't exactly have your best interests in mind. </p>
This is where the problem comes in.  This backstory (which I'll refer to as the groundwork for a plot twist), has to be injected into the story quite a bit before the actual twist occurs.  In other words, the generator needs to commit to a plot twist well in advance of the plot twist becoming apparent to the characters.  Worse still, a plot twist is generally going to be fairly open-ended; giving the procedural story generator an enormous number of possibilities (many of which could potentially conflict).  </p>
It gets even worse -- due to player actions, or other events within the story, the groundwork for a story might be made entirely irrelevant.  For example, maybe one day when the players bring Bob on a hunt, he's accidentally mauled to death by a bear.  Balancing the groundwork for plot twists in a D&D game is tricky enough for a human to do... I'm not sure if it's even possible for a computer.</p>
Still, it would be interesting to see.</p>
(side note: I've been discussing this idea from the perspective that players interact with the story in a linear manner.  An interesting game mechanic that this sort of procedural story generator would enable is time-travel.  Allow players to provide a basic AI for their characters, and then bop back and forth, adjusting their character's behavior at various points in time to see how it changes the world)</p>


Procedural Stories and Constraint Satisfaction (Part 1)
2013-01-06T00:00:00+00:00
Ages ago, in the first issue of Dragon Magazine that I had ever read, there was an article on campaign preparation.  For those of you unfamiliar with it, Dungeons and Dragons is a game of cooperative storytelling.  One player, often referred to as the game- or dungeon-master, puts together the outline of a story, and the remaining players take up the role of characters in that story.  This outline, often referred to as a campaign, can also be thought of as the fragment of the story not under the direct control of the normal player's characters (PCs).  </p>
The point that the Dragon article was trying to make was that, like any story, a campaign must be self-consistent, or the players will lose their suspension of disbelief and stop being interested.  Over the course of the game, players will learn facts about the world, and the other characters in it (aka, a non-player characters, or NPCs).  For example, players might learn that a certain duke (let's call him Bob) lives in a particular city and hates ferrets.  If a later event calls for some duke to be present half a continent away, in a village known for its ferret breeders, then Bob is probably not the best choice for this particular role.</p>
If you squint hard, the problem of campaign construction begins to look a lot like a big constraint satisfaction problem.  You have certain entities in the world: NPCs, villages, organizations, etc..., and certain relationships between those entities.  Based on these relationships, entities in the world can take actions that change their respective relationships (e.g., one NPC leads an attack on a neighboring town and either succeeds or fails, changing the state of the world in the process).  </p>
The tricky bit is that a world is typically far too complex to try and simulate in real time.  I have yet to meet someone who runs a game of D&D by exhaustively deciding explicitly what all of his NPCs are doing at any given moment.  Rather, what I have seen most often is that a game master will put together an outline of a character's motivations, and maybe some general plans.  They might have a timeline (of varying complexity) that says what will happen and when.  However, especially since the story is meant to be driven by the other players, the exact details are never developed until they become relevant to the story.</p>
Think of this as a sort of Schroedinger's story.  The exact nature of the story can be entirely undetermined, until the players start to interact with it.  </p>
This can be a bit dangerous, since like quantum particles, a game master can't afford to get into an inconsistent state (while preserving the player's suspension of disbelief).  In short, although the story is evaluated lazily, the lazy evaluation must avoid leading the story into an inconsistent state.</p>
 </p>


Happy New Year
2013-01-01T00:00:00+00:00
http://www.xthemage.net/newyear2013/</a></p>


Filesystems, Application Semantics, and Walled Gardens (Part 5)
2012-12-23T00:00:00+00:00
In the past weeks, I've been talking about how a persistence layer for web applications could be developed, informed by both the successes and failures of the iOS walled garden document model.  This week, I'll (hopefully) wrap up with a discussion of how three benefits of iOS can be incorporated into a web application filesystem.</p>
Interface Formats</span></p>
A significant virtue of the iOS document model is that applications are forced to get their data into standardized formats before exporting them out; Stripping off the application-specific metadata and passing only the stuff that another application is guaranteed to understand.  A similar phenomenon can be found, rooted in the oldschool idea of web mashups.  Each application builds on a standardized data representation (e.g., Google Maps, or Facebook Comments).  The data representation has a standardized interface, and allows users to put their own data on top of it.  </p>
There was once a part of the MacOS (back in the OS 7 days) called OpenDoc.  Although it didn't survive for political reasons, it actually featured some extremely nifty ideas.  The core concept was that of nested document types.  An application would register itself, not as a standalone component of the operating system capable of editing entire files, bur rather as having the ability to edit data of a certain type.  For example, you would have a word processing application component.  These application components could be nested -- The word processing document could have graphical data embedded within it, and would allow the user to edit that graphical data through a fully-featured graphical editor.  </p>
Similar ideas have been brought up in the web world.  XML (and to a lesser extent HTML) are perfect examples of this.  XML's nested structure echoes this idea perfectly, and many a web application has been embedded into another through iframes.  </p>
The web is ideal for this sort of application design.  The persistence layer should encourage developers to structure their applications to work with this general layout.</p>
Presentation</span></p>
Filesystems are tricky.  People have different ways of identifying document data.  Although the filesystem has always forced us to use filenames, this archaic concept is rapidly being replaced in many contexts with internal identifiers that the user never needs to see or interact with.  Different types of data can be presented in different ways.  Short summaries of text (something commercial operating systems have  automated the extraction of since the mid-90s) can be useful for paging through textual data.  Thumbnails are excellent for graphics.  Short previews work for video/audio data.  Header comments are useful overviews of code.  Even things like date/time last modified can be useful in the identification of what you're looking for.  In short, the filesystem needs to be able to work with the application in order to better visualize the data contained within.  The OSX quicklook feature is a great example of this, as each application can provide a plugin that quickly renders a preview snapshots of individual files.  </p>
Non-Document Data</span></p>
Non-document data is, quite frankly, the hardest to work with.  Consider an address book, or a BibTeX bibliography manager.  You might have multiple individual address books, or multiple bibliography documents, but ultimately you want the data in these address books or bibliographies to be linked.  If your friend and coworker changes their phone, you want the phone number updated in both your personal and work contact lists.  If you find a typo in a bibliography entry, you want to fix the typo in all of your manuscripts.  Decentralization is critical.</p>
This in turn makes it hard to share data without running into the security concerns I discussed last week.  Logical groupings of entries are one way to manage access control, and specialized interface widgets are another.  What it also means is that to be efficient, the filesystem has to operate on a sub-document granularity.  For example, the HFS filesystem featured a quaint little idea called a resource fork.  In addition to the normal notion of a file as a sequence of bits (the data fork), each file also had a structured component.  The structured component contained lots of bits of data (resources), each individually addressable.  Moreover, there were standardized ways of accessing these bits of data.  For example, you could have a collection of icons in one file.  The operating system provided primitives that allowed any application to get into that file and access each of those icons as needed.  More recently, OSX's notion of Packages or Bundles achieves a similar end.  Applications are conceptually a single file, but have structured contents in standardized formats such as graphical data or XML/plist.</p>
In short, it is extremely helpful if the filesystem supports (securely) being able to drill down into data from other applications, no matter how it's structured.</p>
 </p>
Well, that's it for this thread.  On to new and wonderful things next week.  Happy Holidays, and qoSraj QI'lop jaj ghubDaQ.</p>


Filesystems, Application Semantics, and Walled Gardens (Part 4)
2012-12-17T00:00:00+00:00
For the past month, I've been writing about the similarity between the iOS document model, and that of most modern web applications.  After acknowledging the strengths of the iOS document model, last week I addressed the danger that such a similarity poses for the future of web applications.  </p>
Although we don't want to go wholeheartedly down the iOS route of walled garden filesystems, we can still learn from their efforts and successes.</p>
So... what is there to learn?</p>
Security</span></p>
Security is perhaps the most difficult to address, so let's start with it.  The strength of the iOS document model lies in getting users to actively indicate that they want to transfer control of a specific document between applications, rather than passively accepting that an application wants to access/change a document.  This is a critical distinction, because it forces the user to make a conscious decision to grant permissions instead of just rubber-stamping a request that pops up.</p>
Can we do something similar for web applications?  Part of the answer to this question depends on how we address the challenge of a web-application filesystem.  There's a huge space of possibilities here, so I'm going to adopt the most general form possible: There are three entities/trust in the system: your browser, one or more sites that hold your data, and one or more sites hosting the web applications you use.  They don't have to be separate, but I'll think about them as being so to keep things simple.</p>
Practically speaking, there are two separate levels of authorization to grant: access to read the data, and access to modify the data.  Let's break things down and address each in turn.</p>
Read access is easily the harder of the two to pin down, mostly due to limiting the scope of access.  Let's say I have a web application for email/messaging and a second application with my address book.  Clearly, the email client could make use of the address book data.  The absolute wrong way to do this is for the email/messaging application to put up a dialog box saying "Can I have access to your address book data?"  I'm not just talking about authorization issues here; Even gimmicks like Facebook/Twitter's application authorization tokens essentially boil down to the same thing: "Click button in order to use software".</p>
We need to get users to consciously decide that they want to transfer data between applications.  In its simplest form, this means, from within the address book application (or the filesystem storing the data) clicking on a standardized widget to "Open this data with email client".  </p>
Even this though, is somewhat awkward.  You don't want to have to do this every time your address book data changes, and you might not want to grant access to your entire address book.</p>
A second option would be to use the drag/drop metaphor.  Dragging contact information from one web application to another is a clear indication that a user wants to transfer access to the contact to the email application.  Still, this is somewhat awkward.  It would be nice to have address book support within the application itself.</p>
HTML/Javascript provide us with a third option.  Javascript provides us with a (securable) framework for importing widgets from one codebase into another.  I'm not sure how a secure implementation of this could be properly developed, but you could use javascript to modify the email application's text input field to pop up an autocomplete panel.  Selecting an autocompletion would be an explicit choice on the user's part to pass the contact information over to the email client.  Of course, now we've just reversed the problem -- This approach means that you need to find a way to get the user to agree to the address book modifying their email client.  Furthermore, browser security being what it is, you want some way to guarantee that the address book javascript code doesn't have access to the application state.</p>
Ok, that was a bunch of blathering about read security.  In most cases, an approach analogous to the traditional filesystem approach suffices: double click a document to open it, or bring up a widget that's part of the filesystem, which grants access to the selected data.  Any application-specific state can be kept separate, and inaccessible to other applications.</p>
So what about writes?</p>
I don't have a particularly good answer here.  The problem is that you don't want one application to do something to your data that breaks another application's functionality.  In part, this can be solved by keeping application-specific metadata separate from common state.  Most likely, the best approach here is to keep state in versioned form, like a distributed revision control system (e.g., GIT).  If an application breaks something or deletes something, you provide the user with going back and undoing some or all changes performed by the offending application.  This approach works reasonably well for Wikipedia, which is subject to a similar attacker model.  </p>
And, as expected, I've gone pretty crazy with this discussion of security.  I'll wrap up for real next week with a discussion of the remaining three (smaller) points: Interface Formats, Presentation, and Non-Document Data</p>


Filesystems, Application Semantics, and Walled Gardens (Part 3)
2012-12-10T00:00:00+00:00
For the last two weeks, I've been discussing the beneficial aspects of the iOS walled garden document model.  As it turns out, there are quite a few.</p>
That said, let me emphatically state that this model is a bad idea.  It forces all stages of your document processing workflow to live in a single application (lest you suffer the pain and agony of tracking multiple document versions).  This in turn hurts a developer's ability to deliver functionality incrementally (e.g., a developer who wants to deliver a simple graphics filter has to develop a full graphics editing suite around it).  </p>
I regularly use two different text editors (SubEthaEdit and TexShop) when editing LaTeX source, and the LaTeX compiler is a third application that needs to access the files.  Sometimes I script certain pieces of functionality (e.g. generating certain tables or graphs).  In a desktop processing environment, this is typically trivial.  Every bit of data is accessible through a shared filesystem.  The lack of a shared filesystem on iOS means that there might be five applications, each with 90% of the features required by your workflow, instead of one set of composable applications that provide all of those features.</p>
This is unacceptable.  However, iOS is Apple's sandbox, it's their prerogative to design it how they see fit.</p>
Unfortunately, this is not the only space where one sees the walled garden document model.</p>
Consider Google Docs, Office 365, and iCloud.  In theory, they're compatible.  They share compatible formats for word processing documents, spreadsheets, and presentations (even if it is the office format).  But, like a walled garden, each has its own domain.  If you want to edit a presentation in Google Docs, you upload it to Google Drive.  If you want to use it with iCloud, you import it into Keynote, and something similar has to happen if you want to use Office 365 or some other online presentation software (e.g., Presvo).</p>
Unlike with the iOS, this does not appear to have been a deliberate decision on the part of the application developers.  Every web application has its own mechanism for keeping persistent state, quite simply, because no user would use an application that deletes all your data whenever you quit your browser.  They persist state because it's a nice side effect of having to share data between multiple clients/browsers.  What they don't do is persist state because they expect another application to be able to start messing with that state.</p>
Let me put that another way.  Shared filesystems are (by design) a nice way for applications both to pass data between themselves and to maintain persistent state.  The mechanic behind both of these is identical (again, by design): Any application can write data to the filesystem, and any application can read data back out of the filesystem (modulo permissions).  In short, a desktop application that needs to store persistent state gets (essentially for free) the ability to automatically exchange data with other applications.</p>
Web applications have no shared filesystem.  WebDAV, the one contender that comes to mind, is frequently too unstable or slow for practical use, and made even harder to use by browser security models.  Worse still, while desktop application developers can typically assume the presence of some sort of disk drive in the device they're developing for, web application developers can't assume that all of their clients will have a WebDAV server somewhere.  </p>
Cloud storage solutions like Dropbox should present a potential solution, but I haven't seen a lot of uptake there either.  My guess is that this is a combination of the browser security model issue, and latency issues with realtime updates (If anyone reading this has a better idea, I'd love to hear it).  </p>
What this adds up to is that in order to store persistent state, web application developers have had to roll their own application-specific filesystems.  They need to persist state, but don't need to make their web application play nice with other web apps (admittedly, services like Google Drive now play nicely with desktop applications).  </p>
What we need is a filesystem for the web application world.  A system that extends an application's ability to persist state (and/or its ability to replicate and collaboratively edit state with multiple clients), into the ability to collaborate with other applications to form a much more powerful workflow.</p>
The logistics of deploying such a beast (to say nothing of the chicken/egg problem of getting both user and developer buy in) are beyond me at the moment, but it's something that I'd very much like to see happen.  Since this post is already getting fairly long, I'll wrap this segment up next week discussing the mechanics of such a filesystem (were it to exist), and how what we learned from iOS in the last two posts can be applied to it. </p>


Filesystems, Application Semantics, and Walled Gardens (Part 2)
2012-12-01T00:00:00+00:00
Last week, I started talking about some advantages of the iOS data model -- specifically its lack of a common filesystem.  I started by talking about the issue from a file format/metadata perspective.  Today, I look at three more benefits: Security, Presentation, and the Data Cloud</p>
Security</strong></span></p>
Perhaps the strongest argument for giving each application a walled garden is security.  As far back as I can remember, security has been a huge usability problem.  In this case, the problem being attacked is authorization of code: How can a user safely grant an application the right to access a user's secure data?  </p>
The traditional way has been to ask the user (i.e., "Can 'Irate Avians' access your contacts?").  Unfortunately, questions like this require the user to think.  The user has to sit there and ask themselves questions like</p>
"Why is Irate Avians asking me for my address book?"  </p>
"Do I trust the code of Irate Avians to not do anything I wouldn't want with my address book?"</p>
Odds are that a typical user won't have the background to be able to answer questions like this by themselves.  Worse still, because most code running on a typical user's device(s) is perfectly trustworthy, so generally, the answer to these questions is yes and the user learns that it's ok to not think about these questions.  </p>
So how do you get the user to really think about these questions?  </p>
Quite simply, you don't. </p>
Instead of having the user try to figure out the application's intent, you design your user interface so that the user's</strong> intent is clear.  </p>
The walled garden model forces a user to push state from one application to another, rather than the filesystem based model where each application pulls state from the filesystem.  In both cases, the application sit's between the user's intent and the filesystem.  </p>
However, in one case, a (potentially) untrusted application is acting on its own, while in the other a trusted application (one that already has access to the data) is indicating that it wishes to extend that trust to another application.  In the latter case, the user (through the trusted application) has explicitly given permission for their data to be accessed.</p>
Presentation</strong></span></p>
Different types of documents can be presented in different ways.  For code, a nice hierarchical, alphabetical listing is often best.  For photos, you want thumbnails.  In short, the nature of the browser being used depends heavily on the type of work you're doing.  We're seeing this phenomenon with apps like iPhoto, Front Row, Eclipse and Xcode, each of which has a custom file browser (some of which don't even correspond to the underlying filesystem).  </p>
There's not really much to this point, just that different types of data need to be organized in different ways, and the entity best suited to managing this organization is the application that created/is responsible for managing the data.</p>
The Data Cloud</strong></span></p>
On a related note, some data doesn't fit into a neat little document (or similarly, into the filesystem) model.  Sometimes you want to keep different datatypes independent (i.e., The giant mess of files in a website).  Sometimes you have data that can't be structured exactly into a hierarchical model (i.e., email, or music).  </p>
A perfect example of data that doesn't fit into the document model is social network data (and graph data in general).  In this type of data model, you have lots of little nuggets of information, which can be sorted, organized, collected, distributed, and aggregated in any number of ways.  There's not generally a single concept that you can group each comment, post, or message into.  Sure, you can put this data into a file, but more likely than not, all of this data will go into a single file (or equivalently, into a single grouping of files).</p>
For this type of data, the walled garden is great.  You can present the user with an interface ideally suited to organizing, aggregating, and displaying all of these little nuggets of information.  Since such applications tend to be decentralized, you can easily interface with networked components as well.  You don't need to forcibly coerce this application-specific data presentation model into the filessytem model that everything else expects.</p>
 </p>
So there you have it.  Four particularly nice aspects of the iOS walled garden application state model.  Next week, I'll talk about how this all connects to web applications.</p>


Filesystems, Application Semantics, and Walled Gardens (Part 1)
2012-11-24T00:00:00+00:00
There's been a fundamental limitation of web applications that has bugged me for the longest time.  I don't think I got a good idea of what it was until I started looking at my iOS devices, and realized it was the same thing that bugged me about them.  There's no common filesystem on any of them.</p>
This is something incredibly frustrating about the iOS: Each application has its own filesystem.  Each application has its own way of listing your available documents.  Each application has its own way of interfacing with external document storage systems (i.e., Dropbox, iCloud, etc...).  It's possible to move documents between these filesystems: There's a shared photo repository, and a way for users to explicitly send documents from one application to another, but this functionality has to be enabled by the developers of both applications, has to be initiated by the user explicitly, and creates a copy of the document, which makes it a pain to keep track of which application currently has the "working" copy of your document.  </p>
This same thing has been showing up in web applications.  The scarcity of fully featured web applications makes it difficult to see this effect clearly, but try editing a document with both Office 365 and Google Docs, and see how sane you stay.</p>
This begs the question: why?  Why limit your applications in this way?</p>
In the case of web applications, it's a technical limitation.  I'll get back to that, but first let's have a look at the iOS.  There's an actual hierarchical filesystem under the hood.  App developers must deal with paths and file pointers.  Yet, there was a conscious decision on the part of iOS's designers to force application developers to build their own document management systems:</p>

There's no file management widget or application.</li>
There's no integration support for alternative hierarchical document management systems (e.g., Dropbox).</li>
Each application has (conceptually at least) its own independent filesystem.  </li>
</ul>
So why?  Why build each application into a walled garden?</p>
I don't have a singular answer to this question, but this model actually has a number of very nice benefits.  </p>
Specialized vs Standardized Formats</strong></span></p>
There are a number of formats out there, many of which are (explicitly or implicitly) standardized.  PDF, Jpeg, PNG, TIFF, MP3, Matroshka Video, Powerpoint, and Rich Text Format are all examples of formats for encoding a wide range of different document types.  Even among standards, there is some duplication.  Jpeg, PNG, and TIFF are all formats for encoding image data, but each has a slightly different set of benefits, each applicable to different types of image data.  </p>
If there's this much duplication among standard image formats, imagine how much duplication there is among non-standards.  Let's have a look at some image formats used by the products of a single company: Adobe.  Photoshop, Illustrator, and InDesign each manage image data, but have their own distinct formats.  </p>
Each application is designed to do something different.  Photoshop is designed to edit raster data, Illustrator is designed to edit vector data, and InDesign is designed to manage page layouts.  Because the applications are designed with different functionality in mind, each has a different notion of what an ideal layout is for the data being managed.  If nothing else, different applications may find different metadata or index structures necessary on top of the core data.</p>
My blog editor is a good example.  At the heart of it, the editor just manages a list of rich text (HTML) files organized in a nice simple hierarchy.  But then for each directory it has some metadata (the blog that the directory corresponds to), and for each text file it has some metadata (title, tags, categories, server options).  There are inter-document relationships that it keeps track of (if a post has media/images/etc...), and it keeps itself synchronized with the blog posts already on the server.  The core content (the HTML text file) is enhanced by application-specific metadata that allows the editor to effectively interact with the blog.  It would be much harder to design such an application if the user was required to explicitly manage application state.  Inter document relationships in particular are extremely difficult to manage (as anyone who has tried to move HTML from one website to another can attest to).  </p>
Meanwhile, providing an explicit export function (a'la iOS) forces the application developer to provide functionality for translating their own custom document format/metadata/etc... into a standardized format.  This clearly demarcated boundary, if used properly, could actually increase</strong> compatibility between applications, while allowing each to maintain their own custom data formats appropriate for their own specific application.</p>
This is running a little long, so I'll return next week with a few more benefits of the iOS document model.</p>


Uncertainty in Distributed Computation
2012-11-18T00:00:00+00:00
Probabilistic databases are a solution to a simple problem -- Sometimes you don't have all the data.  </p>
Probabilistic databases address this problem in the context of a specific domain: asking questions about data that is incomplete, imprecise, or noisy.  But this is only one domain that this problem occurs in; Noisy, incomplete data occurs everywhere.</p>
A prime example of this is distributed computation.  Each node participating in the distributed computation knows (for certain) what is going on locally, but not what's going on elsewhere.  If an update occurs on one node, it takes time to propagate to other nodes.  </p>
A good way to think of this is that the node has its own view of the state of the world.  Slowly, over time, this view diverges from the "real" state of the world.  As the node communicates with other nodes, the view reconverges.  </p>
Many early distributed protocols were designed to enforce this sort of convergence, at least to the point where certain properties (e.g., relative ordering) could be guaranteed.  For the past few years, the fashion has been to use eventual consistency, where the end-user is presented with results that are not guaranteed to be entirely accurate.  </p>
This doesn't have to be a binary choice; many such systems (Zookeeper[1], PNUTS[2], Percolator[3], etc...) offer a hybrid consistency model where end-users can choose to receive results guaranteed to be consistent, albeit at the cost of higher access times.  </p>
What I've been seeing lately is a tendency to take this even further: To actually try to capture the uncertainty in the computation in the distributed programming model itself.  The first instance that my quick (and quite incomplete) scan of deployed systems was Facebook's Cassandra [4], which used a technique called φ-accrual [5] to get a running estimate of the likelihood of a particular server being up or down.  </p>
More recently, a similar idea has appeared in Google's Spanner [6].  Here, the uncertainty was on the timing of specific events, and the goal was to determine relative ordering and to obtain guaranteed consistency by establishing a bound on how accurate (or inaccurate) the timestamps you're using are.</p>
This idea can be taken a lot further.  Although I can't imagine programmers wanting to explicitly account for uncertainty in their code, they may be willing to work with a language that does this accounting for them.  Maybe I don't need a precise result to present to the user, maybe I just need something in the right ballpark.  Maybe I just need an order of magnitude!</p>
What would a language designed around this look like?</p>
How could the programmer specify the bounds on uncertainty that they were willing to accept?</p>
Could such a language be combined with online techniques (i.e., provide the end-user with a stream of progressively more accurate answers).</p>
Can PL ideas such as promises be adapted to this context?  Here's an answer, it has accuracy X.  The result of the computation you want to do with it can also be computed, and the uncertainty of that computation (based on the uncertainty in the input) is Y.</p>
This seems like it would be a really cool programming platform, if it could be made to be both usable and efficiently functional.</p>
Citations</h3>
[1] Hunt, P. et al. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. USENIX ATC. (2010).</p>
[2] Cooper, B.F. et al. 2008. PNUTS: Yahoo!'s hosted data serving platform. Proceedings of the VLDB Endowment. 1, 2 (Aug. 2008), 1277–1288.</p>
[3] Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. (2010).</p>
[4] Lakshman, A. and Malik, P. 2010. Cassandra—A decentralized structured storage system. </span>Operating systems review</em>. (2010).</span></p>
[5] Hayashibara, N. et al. The φ accrual failure detector. 66–78.</p>
[6] Spanner: Google's Globally-Distributed Database: http://research.google.com/archive/spanner.html</p>
 </p>


Consistency through semantics
2012-11-11T00:00:00+00:00
When designing a distributed systems, one of the first questions anyone asks is what kind of consistency model to use.  This is a fairly nuanced question, as there isn't really one right answer.  Do you enforce strong consistency and accept the resulting latency and  communication overhead?  Do you use locking, and accept the resulting throughput limitations?  Or do you just give up and use eventual consistency and accept that sometimes you'll end up with results that are just a little bit out of sync.</p>
It's this last bit that I'd like to chat about today, because it's actually quite common in a large number of applications.  This model is present in everything from user-facing applications like Dropbox to SVN/GIT, to back-end infrastructure systems like Amazon's Dynamo and Yahoo's PNUTs.  Often, especially in non-critical applications latency and throughput are</strong> more important than dealing with the possibility that two simultaneous updates will conflict.  </p>
So what happens when this dreadful possibility does come to pass?  Clearly the system can't grind to a halt, and often just randomly discarding one of these updates is the wrong thing to do.  So what happens? The answer is common across most of these systems: They punt to the user.  </p>
Intuitively, this is the right thing to do.  The user sees the big picture.  The user knows best how to combine these operations.  The user knows what to do, so on those rare occurrences where the system can't handle it, the user can.</p>
But why is this the right thing to do?  What does the user have that the infrastructure doesn't? </p>
The answer is Semantics.</p>
Each update does something with the data. It increments, it multiplies, it derives, it computes.  It produces some new value of the data.  It has specific semantics, and the systems I enumerated above (and those like them) make no effort to try to understand those semantics.  I addressed this in part already, when I discussed intent vs effect a few weeks ago.  </p>
The user, conversely, does understand the semantics of an application.  Given two updated values (and a suitable visualization tool, like diff), a user can usually infer the intent of the updates and merge their effects appropriately.  </p>
Sometimes this is the best way to do things.  When writing source code or other text, where the user is directly modifying the files (i..e, the system never receives a representation of the intent in the first place), the overhead of manually merging periodically is typically lower than the overhead of having to encode edits in terms of intent (though this might be interesting if combined with bug/feature tracking systems).  </p>
Conversely, if an application is interacting with the data directly, the application can provide tools for resolution.  This is indeed the case in Dynamo, where the application provides a merge function for resolving inconsistent updates.  But this is only the first step.  What can you do to avoid creating two inconsistent versions of the data in the first place?  How do you infer the user/application's intent, while minimizing the burden of declaration placed the user/app developer.  </p>
In short, what can you do to both detect and leverage an application's semantics to help the application stay consistent?</p>


What's Wrong With Probabilistic Databases? (Part 3)
2012-11-04T00:00:00+00:00
Two weeks ago, I introduced the idea of probabilistic databases.  In this last installment in this little miniseries, I'm going to talk about the second major use of probabilistic databases: dealing with modeled data.</p>
Unlike last week where we talked about having missing or erroneous data, where there is some definitive ground-truth, a probabilistic model attempts to capture a spread of possible outcomes.  Not only is there no ground truth, there usually won't be (at least not so long as questions are being asked).</p>
That's not to say that there's no overlap between modeled and erroneous data, just that there's a different mentality about how this data is used.  In this case, queries encode scenarios rather than questions.  </p>
That is to say that a probabilistic database must take its uncertain inputs from somewhere.  At some level, there has to be a probabilistic model (or more likely, several) passed as input to the query.  Even if the probabilistic database is capable of filling in any parameters that the model needs, someone still had to sit down and figure out the general framework of the model.  This role generally falls to someone with a background in statistics.  </p>
This is where the problems come in.  The machinery required to get even a relatively simple model off the ground is usually pretty extensive.  Even something as simple as a gaussian distribution can require days, or even weeks of validation against test data.  So, if you want to ask questions about your nice, simple, elegant model, you're not going to want or need the complex machinery of a database.</p>
That said, where the machinery does come in handy, is when you need to integrate multiple models, or to integrate your model with existing data.  A simple example I used in a paper a while back was for capacity planning: One (simple) model gives you the expected capacity (e.g., CPU) of a server cluster at any given time over the next few months (e.g., accounting for the probability of failures), while a second (also simple) model gives you the expected demand on that cluster.  Each of these can be tested, analyzed, and independently validated.  Then, these models can be combined to provide a single model for the probability of having insufficient capacity on any given day.  This relationship can be represented as an extremely simple SQL query, and then executed efficiently on a probabilistic database.</p>
In short, probabilistic databases can be used as a sort of scaffolding to combine multiple data sources (both real and modeled) together, to build more complex models.  </p>
So what's missing?</p>

Langauge Support</strong>: Although many statisticians are comfortable working with SQL, this is not the case for everyone who uses probabilistic models.  Languages like R, Python, Java, and C++ are far more common, and less alien to researchers and model-builders.  There has already been some work on integrating these languages with database techniques.  The Scala guys are working on improvements that let you translate code written using certain fragments of Scala into equivalent database queries.  There's no reason that we can't do something similar with Python.  Similarly, there have been numerous efforts to translate R into some form of relational algebra.</li>
Efficiency</strong>: Over the years, database research has become synonymous with work on monolithic one-size-fits-all database systems.  Most work with nontrivial-models already requires extremely expensive monte-carlo methods, and model-builders are often reluctant to delegate the task of hand-optimizing their code to an automated system that they perceive (often correctly) as being less efficient.  We need ways to give them good performance out of the box, with a minimum amount of coding overhead and setup.  If this performance is insufficient, we need to make it possible for them to seamlessly transition to an environment where they can fine-tune the evaluation strategy, again, without needing to learn anything that they don't already know.</li>
Interfaces</strong>: R provides a number of useful analytic and visualization tools right out of the box.  While I suspect that no probabilistic database will be quite as complete in the short term, we need to get there before people will start looking seriously at probabilistic databases as an effective analytics and probabilistic modeling tool.</li>
</ul>
Given all of this, I think probabilistic database techniques could be adapted easily.  The only real challenge standing in our way at this time is interfaces.  How do we present end-users with an interface that is not only as powerful as the tools they're used to working with, but is also similar enough that the learning curve is minimized.</p>


What's Wrong With Probabilistic Databases? (Part 2)
2012-10-27T00:00:00+00:00
Last week, I introduced the concept of probabilistic databases: databases that store values characterized by probability distributions, and not (necessarily) by specific values.  Although a pretty cool, and potentially quite useful idea, there are a number of practical concerns that have prevented it from gaining traction.  This week, we explore one class of problems that probabilistic databases are ideally suited for: noisy data.  </p>
Data is only useful if you can analyze it -- ask questions about it.  Problem is, that very few data-gathering pipelines are entirely perfect.  Data can be missing.  Data can contain typos.  Data can contain measurement error.  And on top of it all, even if a data source is perfect, god help you if you have to combine it with another data source.  Integrating multiple data sources means dealing not only with inconsistent formatting, but also inconsistent data values (The same person could be referred to as Mary Sue, Ms. M Sue, Mrs Sue, or any of a practically infinite number of variations on the same theme).  A whole research area (typically called entity resolution) has sprung up around this problem.</p>
In short, before using any dataset (or when merging two datasets), it's typically necessary to go through a (often time-consuming) data-cleaning process.  Often, this process can be automated.  You can call out obvious errors: duplicates of values that should be unique, values out of bounds, improperly formatted expressions, etc... Many of these issues can be fixed automatically.  Formatting mismatches between different datasets can be easily fixed by translating to a single common format.  </p>
Unfortunately, automated processes can only take you so far.  If two people appear with the same social security number, then clearly at least one of them is wrong.  But typically, an automated process can't decide which is correct, nor can that process decide what kind of number to assign to the person who now has no identifier.  Typically, one of three things happens at this point:</p>

Immediately punt to the user</strong>: A common example of this is key constraints in traditional databases.  The datastore simply won't allow data that it knows to be unclean to be entered into the system.  This approach ensures that data is correct before anyone asks any questions about it, but necessitates end-users to put a huge up-front effort into ensuring data quality before a single question can be asked.</li>
Guess</strong>: This happens often in situations where an automated system can efficiently compute the probability that a particular interpretation is correct, like handwriting recognition or sentence parsing.  The system settles on one specific way of interpreting the data (the one with the highest probability), and discards the rest.  Ironically, this can be just as much a source of data errors as any other data gathering process if the guessing algorithm isn't perfect.</li>
Ignore it</strong>: Failing all else, you can simply ignore the problem.  You implicitly accept that answers to your questions may be erroneous, but don't especially care.</li>
</ol>
In short, either you put a huge amount of effort in upfront to clean your data, or you deal with mistakes in your answers.  </p>
Probabilistic databases aren't a magic bullet.  They can't magically make your data clean, or fix the mistakes in your answers.  What they can do, however, is tell you how much of a mistake there is.  And you don't even have to do anything different.  You can just query your data as if it were normal, ordinary data.  All of the trickery for dealing with uncertainty happens under the hood, except that you get a probability value as your output.</p>
So where's the problem?  Why aren't probabilistic databases being used more aggressively?  </p>
As I see it, there are two issues at play here for the general populace:  </p>

People don't know what to do with probabilities</strong>.  Statisticians aside, very few people know how to deal with probabilities.  If someone gets a response that is 75% accurate, they're not going to generally want to perform a complex risk analysis.  Either they trust the result or don't.  In other words, guessing is usually sufficient here, because ultimately the user is interested in the most probable result anyhow (which isn't always guaranteed by guessing).  </li>
People don't know how to define probabilities to begin with</strong>.  Again, statisticians aside, very few people can build good statistical models.  Sometimes your data comes with probabilities already associated with it, but more likely than not, the average data-user won't have a good sense of how to define their automated data cleaning processes probabilistically.</li>
</ul>
In short, the problem with applying probabilistic databases to the challenge of noisy data is the probabilities.  People are used to dealing with fuzzier notions: "Certainly Not", "Unlikely", "Possibly", "Likely", Certainly So".</p>
So what's the takeaway from all of this?  Well, I'm not entirely certain.  I think probabilistic database research needs to start looking at ways of isolating users from the specifics of the probabilistic distributions underlying the system.  Instead of presenting users with query results and probabilities, we need to give users a more intuitive way of visualizing what possible outputs there are.  Rather than giving users specific confidence values, we need to give users a more intuitive notion of how to interpret that confidence value.  </p>
Better still, we need to provide the user with things that they can do to improve the confidence level; Instead of immediately punting to the user when a data error occurs, let the user run queries on the noisy data, and then point them at the specific cleaning tasks that they need to perform in order to get better results.</p>
And of course, we need to give users better tools for automating their data cleaning processes -- tools that natively integrate with probabilistic database techniques.  Tools that know how to associate probabilities with the data they generate.  </p>
Next week, we look at a second class of problems that probabilistic databases can be used to address: modeling.  </p>


What's Wrong With Probabilistic Databases? (Part 1)
2012-10-21T00:00:00+00:00
A large chunk of my graduate work has to do with a subfield of database research called probabilistic databases.  </p>
The idea is simple: Most databases store precise values.  A row of a normal database might indicate that Bob's SS# is 199-..-....</p>
A probabilistic database allows users to provide data specified by a probability distribution.  Perhaps Bob was a little sloppy filling out a form, and the OCR software couldn't tell whether he intended to put down 199-... or 149-...</p>
It might not be able to determine a precise value for that slot, but it can tell you that Bob's SS# is either 199-... (with some probability) or 149-... (with some other probability).  The database can store both of these.  When you write queries over this data, you treat them as normal, ordinary queries.  When you get an answer, you get an answer with some probability distribution: If you're asking for someone with SS# 199-..., then you'll get the answer Bob (with the corresponding probability).  </p>
Probabilistic DBs are pretty cool.  Unfortunately, they haven't managed to get much traction beyond the research community.  They've been applied here and there (including in some of my own work), but as of this time, no major DB vendor supports ProbDB functionality.  Why?</p>
To answer that question, we first have to understand why people would use a probabilistic database.  We have to start with sources of uncertainty.  Most probabilistic database work attempts to address one (or both) of two types of uncertainty:</p>

Noisy Data</strong> - Your data gathering process is flawed (e.g., using OCR software or web-scraping techniques).  The data you have contains typos, omissions, or other mistakes.  A thorough data-cleaning could potentially fix these errors, but you lack the necessary manpower or resources.  That is to say that a hypothetical 'clean' version exists.  When the data is queried, you want to find the query results most likely to correspond to the query results on this clean version.</li>
Missing Knowledge</strong> - The data being queried is derived from a model, and has no corresponding 'clean' version.  There are many possible outcomes, each with a varying likelihood.  Queries over this type of data are typically the database to make a prediction, and you're typically looking for an expectation or a percentile result.</li>
</ul>
Although the underlying techniques used to query both types of data are extremely similar, the way users approach both of these types of data are quite different.  Over the next two weeks, I'll talk about each of these, and try to understand what's keeping people from using probabilistic databases to address these problems.</p>


Intent
2012-10-14T00:00:00+00:00
What is the difference between intent and effect?  </p>
It's tempting, especially for a computer scientist to consider both of these to be similar.  Intent is, after all, just an effect that hasn't happened yet.  </p>
On the other hand, an intent may never happen.  It might happen in one of a number of different ways.  Attempting to describe an intent in terms of its effect (or possible effects) is as like as not to be incredibly inefficient.  This is not at all a new concept -- Databases have, for ages now, supported an access mode that allows users to express their intent as a sequence of operations (i.e., transactions).  Even so, transactions are expressed in terms of effect.  The user says "UPDATE" and the database applies the relevant changes to its state (perhaps without committing them, but the update is still effected)</p>
This is quite helpful.  User code can test the database for its present state, and take actions dictated by the results of those tests.  User code can specify iterations over multiple entries in that state.  It is frequently possible to specify the user's intent far more compactly than the effects of that intent.  Better still, in this way, we can encode the full range of possible effects and outcomes of that intent.</p>
This too is not an entirely novel idea.  Modern, distributed database systems have noticed this nice, friendly, compact encoding, and started allowing users to package their intent into nice little snippets of code to be executed as a transaction.  The code executes on the database, which can guarantee that the user's intent is followed precisely, without the need for (slow) locking, or commit protocols (that may require the user to repeatedly restart their transaction).  The user's intent is seamlessly translated into an effect.</p>
Prepackaged intent is good for interleaving transactions coming from multiple clients.  But what about for actually managing the data itself?  Even these database systems will eventually evaluate the intent, and transform it into a nice flat, easily readable form.  But what if you weren't concerned about reads?  What if you were simply interested in keeping data in synch?  </p>
We started by expressing intent in terms of effect.  Then we moved to a seamless transition between intent and effect.  Why not push things a step further?  Why not express effect in terms of intent?</p>


Collaborative web applications
2012-10-08T00:00:00+00:00
This week, I'm going to step back from AGCA and start venturing into some higher level topics.  This week, let's talk about web applications.  </p>
Not just any web applications mind you, let's talk about collaborative web applications.  The term is new, but the idea isn't.  You've probably heard of at least a few of the following: Google Docs (aka, Google Drive now), Google Wave, Office 365, Dropbox... </p>
These applications are pretty nifty.  They allow you to log in from wherever, and edit/view documents.  But not only that, they also allow you to interact with other users of the same system.  If someone else opens up the same document, they see any changes that you make as soon as you make them.  In other words, the state of the application is mirrored in realtime across all the browsers in which the application is running.</p>
Building these applications in a web-browser also forces you to rely heavily on the browser metaphor, HTML5 functionality, and on HTTP.  Integrating an application's design with this ecosystem is often kludgy, but can bring several extremely nice benefits:</p>

A distinction between communications channels</em> and communications sessions</em>: You usually can't rely on a single, stable connection to the server, so the communications protocol will typically place each message in a separate HTTP request.  This in turn means that the application is resilient to being suspended (i.e., switching apps on an iPhone, or putting your laptop to sleep), or moving across networks (switching from cellular to wifi, or plugging your laptop into an ethernet jack).</li>
Stateless servers: The use of HTTP generally encourages servers to be stateless -- each request is served in an identical manner.  This means that the server can be designed to scale, without sacrificing the client's ability to suspend/resume its participation in the computation.</li>
A Refresh Button: Web browsers have a refresh button.  The application has to be resilient to users pressing it if something is misbehaving.  This in turn makes error recovery much easier -- if the application is misbehaving, restarting it is trivial.</li>
Layout through the DOM: Web-based applications have been designed from the ground up to be GUI-oriented.  A text-based interface is possible, but frequently more work than a simple graphical user interface.</li>
</ul>
Applications like this are incredibly cool, and incredibly useful.  So why don't we see more of them?</p>

Why doesn't blogger have the same functionality?</li>
Where are the collaborative whiteboards?</li>
What other kinds of awesome applications could you implement in this space?</li>
</ul>
The short answer is that the infrastructure isn't there.  And it's hard.  Ask any first-year distributed systems student -- they'll tell you that state replication isn't easy, even when your data is only coming from one source.  Web developers are used to dealing with nice, simple, standardized backend infrastructures.  Apache, MySQL/Postgres, and maybe some sort of CMS like Django are reasonable expectations... but none of these support the sort of scalable realtime state replication required to implement a collaborative web application.</p>
Just a thought.</p>


Those Marvelous Lifts and Exists (Part 3)
2012-10-01T00:00:00+00:00
We've been talking for the last few weeks about how the Lift operator can be used to express nested subqueries.  This gives AGCA nearly the full power of (non-recursive)SQL.  </p>
That said, there's one thing that AGCA can't</strong> do with what I've said before: existential quantification.  For example, consider the query:</p>
SELECT COUNT(*) FROM R WHERE EXISTS (SELECT * FROM S WHERE R.A = S.A)</pre>
Now, if you stare at this query long enough you might think up with the following potential encoding:</p>
AggSum([], R(A) * S(A))</pre>
In a way, this makes sense.  You get the count of R(A), but only if there's a matching S(A).  Unfortunately, this encoding isn't correct.  Remember, we're dealing with bags, not sets.  What if S looks like:</p>
_<_A_>____#_</span></pre>
 < 1 > -> 2</pre>
And R looks like</p>
_<_A_>____#_</span></pre>
 < 1 > -> 1</pre>
That is, there are two copies of the tuple <1> in S, and our query result will be 2 (instead of 1).  We're not looking for the specific number of tuples in S (that match a particular pattern), we're just looking to test whether there are ANY tuples in S (that match a particular pattern).  </p>
We need an operator that can act as (not quite, but something sort of like) a step function.  Inside the operator is a nested query (just like Lift and AggSum).  If the nested query evaluates to 0, the operator evaluates to 0.  If the nested query evaluates to something other than 0, the operator evaluates to 1, regardless of what precisely the nested expression evaluates to.</p>
This operation is actually something that you can't do with AGCA as I've described it up to this point (try it yourself if you don't believe me).  Thus, we have the Exists operator (which I've just described), and we can express the example query as:</p>
AggSum([], R(A) * Exists(S(A)))</pre>
What about deltas?  Well, it turns out the exists operator is actually quite close to the lift operator.  The exists operator doesn't introduce any new columns into the schema (like the lift), but it is a non-linear operation (unlike AggSum).  Furthermore, unlike the only other non-linear operation (comparison), it can have a non-zero delta.  So, we get:</p>
∂Exists(Q) = Exists(Q+∂Q) - Exists(Q)</pre>
Again, the delta has the original query in it, but this can be addressed using the same materialization tricks we talked about last week.</p>
And that's it.  That's all there is to AGCA!</p>


Those Marvelous Lifts and Exists (Part 2)
2012-09-24T00:00:00+00:00
Last week, we started talking about using the Lift operation to express nested subqueries.  I ended on a bit of a cliffhanger: </p>
∂(X ^= A) = (X ^= A + ∂A) - (X ^= A)</pre>
There's something horribly wrong with this delta rule.  The expression A appears intact, in its entirety in the delta rule (it actually appears not once, but twice).  The delta of a lift is NOT simpler than the original.  </p>
Admittedly, for simple lifts, this isn't a problem.  In particular, when ∂A = 0, then we get</p>
∂(X ^= A) = (X ^= A) - (X ^= A) = 0</pre>
Which is, in fact simpler.  But, once we start putting relation terms into the expression being lifted, we get something nasty.  For example, let's say we wanted to compute the SQL query:</p>
SELECT COUNT(*) FROM R WHERE (SELECT COUNT(*) FROM S) = R.A;</pre>
This translates to the following AGCA expression.</p>
AggSum([], R(A) * (X ^= AggSum([], S(B))) * {X = A})</pre>
If we take the delta of this query with respect to the insertion S(1), we get:</p>
AggSum([], R(A) * ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * {X = A})</pre>
Messy... and it really doesn't help us much.  We could materialize this expression, but since the deltas aren't simpler, if we repeat the process recursively, we'll end up with an infinite number of materialized expressions.  Not good.  </p>
We deal with this problem using partial materialization.  First a little reorganization.   The lifts commute with the relation term:</p>
AggSum([], ( (X ^= AggSum([], S(B) + (B ^= 1))) - (X ^= AggSum([], S(B))) ) * R(A) * {X = A})</pre>
Now, rather than materializing the entire thing, we materialize the lifts separately.  More precisely, rather than materializing the lifts, we materialize the expression being lifted.  </p>
AggSum([], ( (X ^= M(AggSum([], S(B) + (B ^= 1)))) - (X ^= M(AggSum([], S(B)))) ) * M(R(A) * {X = A}))</pre>
Remember our materialization operator M().  Let's call these new datastructures Q1[] (= AggSum([], S(B) + (B ^= 1)), Q2[] (= AggSum([], S(B))), and Q3[A] (= AggSum([A], R(A))).  This gives us the expression</p>
AggSum([], ((X ^= Q1[]) - (X ^= Q2[])) * Q3[A] * {X = A})</pre>
Of course, we can do better.  If we were to applying the standard materialization optimization rules that we discussed several weeks ago to the expression AggSum([], S(B) + (B ^= 1)), we actually get two simpler expressions, one of which is constant, and the other of which is equivalent to Q2.  Thus, our full materialization decision becomes</p>
AggSum([], ((X ^= Q2[] + 1) - (X ^= Q2[])) * Q3[A] * {X = A})</pre>
And applying polynomial expansion, equality lifting and lift unification, we get the absolute simplest expression:</p>
AggSum([], (X ^= Q2[] + 1) * Q3[X]) - AggSum([], (X ^= Q2[]) * Q3[X])</pre>
So that's it.  Don't materialize deltas of lift expressions in their entirety.  There are however, two corner cases that need to be considered.  First, it's often more efficient to recompute the entire expression from scratch than it is to compute the delta.  The precise definition of these cases is a bit subtle and nuanced, but basically, in any situation where there are no correlated variables (i.e., the example above), you're essentially computing the entire expression from scratch... twice.  In these situations, it's entirely reasonable just to recompute the entire expression from scratch, but just once.  If you make appropriate materialization decisions, it may still be possible to compute this in constant time.</p>
Second, in some situations, it actually pays to materialize the delta along with the rest of the expression.  For example, consider the query (note the inequality predicate):</p>
SELECT COUNT(*) FROM R, T WHERE (SELECT COUNT(*) FROM S) < R.A AND R.C = T.C; </pre>
Or in its AGCA form:</p>
AggSum([], R(A,C) * T(C) * (X ^= AggSum([], S(B))) * {X < A})</pre>
Consider the delta with respect to T(dC).  </p>
AggSum([], R(A,dC) * (X ^= AggSum([], S(B))) * {X < A})</pre>
You could materialize R and the S separately, but you'd end up needing to compute a full iteration over all of the elements of R (to evaluate the aggregate over an inequality predicate) on every insertion into T.  Conversely, putting them together creates a new map that you need to maintain, but the new maps add only a constant factor cost to the time complexity of the existing maintenance costs.</p>
 </p>
Next week, I wrap up with my discussion of lifts, with some thoughts on a related operator (and the last operator in AGCA): The exists predicate.</p>


Those Marvelous Lifts and Exists (Part 1)
2012-09-17T00:00:00+00:00
We've been talking for the past few weeks about optimization of AGCA expressions.  So far, most of our optimizations have made one extremely significant simplifying assumption: They ignore nested expressions.  I was going to talk this week about techniques for un-nesting expressions, but before I get to that, I'm going to cover the two sources of nesting in AGCA expressions that I haven't covered yet: Lift in its full glory, and Exists.</p>
So far I've used Lift as a simple form of assignment.  </p>
X ^= {Y}</pre>
Computes the value of Y and assigns it to X.  I've used it for more complex expressions too:</p>
X ^= {2*Y + Z}</pre>
But again, it's a simple arithmetic expression being used in the assignment.  What if we want to do something more complex?  What if we want to express something like</p>
SELECT SUM(A)</pre>
FROM R</pre>
WHERE R.B = (SELECT COUNT(*) FROM S)</pre>
This is an example of a nested aggregate query, and it poses a bit of a problem, both in terms of AGCA, and more generally for incremental computation in the sense of delta operations.  Up to this point, whenever we added (or deleted) a value from one relation, we'd need to add (or subtract) something to (from) the result we were trying to compute.  Nested subqueries are different.  Let's have a look with the following example database</p>
_R_(_A__B_)____#_</span></pre>
   < 1, 1 > -> 1</pre>
   < 1, 2 > -> 1</pre>
   < 2, 2 > -> 1</pre>
_S_(_C_)____#_</span></pre>
   < 1 > -> 1</pre>
Let's evaluate the SQL query.  The COUNT(*) of S is 1, so we find all the rows of R where B = 1 (just the first one), and sum up their A columns for a total result of 1.</p>
Now what happens if we add the tuple <1> to S?  Well, the value of the nested aggregate changes from 1 to 2, so now the query result is based on a completely different set of rows (in this case summing to 3).  The delta isn't just a simple addition; we need to delete the existing value (-1), and then add in an entirely new and unrelated value (+3).  </p>
Put another way, conditionals are different.  They're not straight arithmetic, they actually trigger a different control flow.  This is part of why AGCA restricts conditionals to having only arithmetic expressions in them.  Yet, we still need a way to express these changes in control flow.  Lifts give us an ideal tool for this.  We can express the above query as:</p>
AggSum([], (B ^= AggSum([], S(C))) * R(A,B) * A)</pre>
Read the first term of this expression as "Compute COUNT(*) of S and assign it to B."  </p>
I originally said that the delta of a (simple) Lift was 0.  This is not true in the general case.  In particular, note that so far we've only been looking at lifts where the value being assigned is computed from a simple arithmetic expression.  As we've already covered, the delta of such an expression is always 0.  But what happens when you lift an expression that has a nonzero delta?  For example, what is the delta (with respect to an insertion into S) of:</p>
(B ^= AggSum([], S(C)))</pre>
Let's consider this in terms of the example data above.  The initial aggregate value is 1, so the table for this expression would be</p>
___B______#_</span></pre>
 < 1 > -> 1</pre>
After we add a tuple to S, the table becomes</p>
___B______#_</span></pre>
 < 2 > -> 1</pre>
We can only do arithmetic on the multiplicity column; we can't just add 1 to B (remember, this is supposed to represent a control flow decision).  So... we actually have to delete the old tuple and put in the new one.  In other words, the value computed by delta for this expression should be</p>
___B_______#_</span></pre>
 < 1 > -> -1</pre>
 < 2 > ->  1</pre>
The full delta rule for lifts reflects this insert/delete pair:</p>
∂(X ^= A) = (X ^= A + ∂A) - (X ^= A)</pre>
If you're paying attention, you should notice something horribly wrong with this.  More on that next week.</p>


Optimizing AGCA (Part 3: Unification)
2012-09-07T00:00:00+00:00
Last week we covered equality lifting, the first half of a two-part process for simplifying expressions.  The second part is commonly known in PL circles as Unification.  In some expressions, it's possible to eliminate a lift by inlining the expression being lifted into a variable.  </p>
For example, let's say you have the following expression:</p>
AggSum([], (A ^= B) * A)</pre>
For all practical purposes, that lift doesn't need to be there.  Instead, we can rewrite this expression as</p>
AggSum([], B)</pre>
Much simpler (and to make things even better, we can get rid of the AggSum too, since the inner expression now has no output variables).</p>
Also, keep in mind that if the expression being lifted has already been fully evaluated (down to a simple numeric value), unification might allow us to do even more evaluation down the line.</p>
Fundamentally, that's all there is to this week's theme.  Take lifts and propagate their values through the expression.  Unfortunately, as with many things, the devil is in the details.  There are a number of situations where unification is simply not possible, and some situations where it's possible, but only with a bit of a hack.  So, let's get to it.  What do you need to be aware of when unifying lifts in AGCA?</p>
Syntactic Restrictions</h3>
Simple lifts like (A ^= B) can be unified anywhere.  However, as you may have noticed, more complex expressions can appear on the right-hand side of a lift.  For example, in the expression</p>
AggSum([], (A ^= B+1) * R(A))</pre>
The syntax of AGCA doesn't allow us to write an expression like</p>
AggSum([], R(B+1))</pre>
Admittedly, this is a somewhat trivial case, but as you see when we get to nested subqueries, there's a good reason for this.</p>
Respecting the Scope and Schema of the Complete Expression</h3>
Recall the definitions of the scope and schema of an expression being evaluated.  The scope is the set of variables that are already bound when the expression is evaluated, and the schema is the set of output variables that we're expecting the expression to bind.  If the variable being lifted into appears in either the scope or the schema, it can not be unified.  For example, in the expression</p>
AggSum([], R(A) * ((A ^= B)+(A ^= C)))</pre>
We can't eliminate the lifts, because by the time we get to the two lifts, the variable A is in scope already.  That said, we can do a little bit of rearrangement.  For example, the expression</p>
AggSum([], ((A ^= B) * R(A)) + ((A ^= C) * R(A)))</pre>
is a legitimate rewriting of the first that can be unified.  I'll get into some of these rewritings next week, but most of them are really quite trivial.  Perhaps more challenging is when the variable is in the schema of an expression.  For example:</p>
AggSum([A], R(B) * (A ^= B))</pre>
Now, in this case, we're not allowed to unify A away because it's part of the scope (A must appear in the output).  Yet, there's still a possible simplification of this expression:</p>
AggSum([A], R(A))</pre>
Note, by the way, that simply replacing the lift with an equality and relying on equality lifting to resolve the issue won't work, since A is already out of scope -- we're not allowed to replace it with an equality.  Instead we need a special case to handle this.  If the lifted expression is a simple variable that's not in the scope then we have a chance!</p>
We start with the product expression that the lift is a part of.  In this case:</p>
(R(B) * (A ^= B))</pre>
From here, we backtrack, exactly like we do with equality lifting, until the lifted variable (B) falls out of scope, and then we can attempt to replace all instances of the lifted variable with the variable being lifted into (i.e., replacing all Bs with As).</p>
Respecting the Scope and Schema in which the Complete Expression is Evaluated </h3>
Of course, even with these rewritings, it's possible that neither of these conditions will be satisfied due to external forces.  When an AGCA expression is evaluated, the caller can provide an external scope, or an expected schema.  The most trivial case of this is an expression like </p>
(A ^= B)</pre>
If this is the entire expression being evaluated, the caller must (through external methods) provide a B.  Similarly, the caller expects to read out a result containing a single column: A.  </p>
Because all of this is dependent on the caller, there's very little that we can do about this inside the AGCA framework.  One technique that we've had success with is to use the standard transformations that I'll discuss next week to propagate all the lifts to the head (or as close to the head as possible) of the expression being evaluated, and then explicitly to pick out all the lift expressions that rename variables appearing in the schema.  </p>
For example, if we were preparing to evaluate the above expression with an externally defined scope containing only B, then we would note the presence of the rewriting (A ^= B) at the head of the expression.  We would eliminate this lift from the expression, and replace every instance of B with A.  Then, when evaluating the expression, we would bind A to the value that we would have previously bound to B.</p>
 </p>
And that's it for this week.  Next week, I'll be going over several simple rewrite rules that allow us to minimize the use of AggSum, and other forms of nesting in AGCA.</p>


Optimizing AGCA (Part 2: Lifting Equalities)
2012-09-01T00:00:00+00:00
I'm going to turn, this week, back to optimization of AGCA expressions, and in particular, one pair of optimizations that combine to substantially simplify AGCA expressions: Lifting Equalities, and Equality Unification.  </p>
Recall the four sets of variables that we work with when evaluating any AGCA expression:</p>

Scope variables are variables that are bound (assigned values) by the time the AGCA expression is evaluated (either earlier in the expression, or outside of it).</li>
Schema variables are variables that something outside of the expression being evaluated expects to be bound by this expression.</li>
Input variables are variables that are not bound in the expression we're evaluating (every input variable must be in the scope when the expression is evaluated, but not every variable in the scope must be an input variable)</li>
Output variables are variables that are bound in the expression we're evaluating (when evaluating the expression, every schema variable must be an output variable; if an output variable is in the scope, the expression is treated as a lookup or join)</li>
</ul>
Now, note that because any output variable may be in the scope when an expression is evaluated, the following three expressions are more/less equivalent.  All three have the same input and output variables, and react identically to scope/schema changes from the outside.</p>
R(A,B) * {A = B}</pre>
R(A,B) * (A ^= {B})</pre>
R(A,B) * (B ^= {A})</pre>
Note, by the way, that this is only possible due to the R(A,B).  In the following lift operation, B is an input variable and A is an output variable.</p>
(A ^= {B})</pre>
If we were to look at only the lift/comparison operation (without anything that binds both A and B), then A and B would be input variables for the equality comparison, and one of them would be an output variable in either lift.  In other words, the only difference between these three is which variables are bound and when.  </p>
Now, in general (and in one specific way that you'll see momentarilly), output variables are good.  We like output variables, so when it comes to equality predicates, we want to transform them to lifts whenever possible.</p>
Let's look at an example:</p>
R(A) * S(B) * {A = B}</pre>
This expression is a simple, straightforward equi-join, but is somewhat inefficient.  For every row of R, we'll loop over every row of S, and then pick out only the pairs of A x B where the two variables are identical.  In other words, this is effectively a nested loop join.  Now, consider, consider the following (equivalent) expression:</p>
R(A) * (B ^= {A}) * S(B)</pre>
We've gotten rid of the nested loop.  Now, every row in R is extended with a new column B with the same value as the A column, and we do a lookup on that one row of B.  This is effectively a hash join (assuming we have a hash index already built over S).  </p>
 </p>
Equality Lifting</h3>
 </p>
So how do we generalize from this example?  Let's start with products of simple (relation, comparison, arithmetic expression, and lift) terms.  Our goal is to get an expression Q of this type into the form </p>
Q := X * Y * {A = B}</pre>
Where X and Y are both individual expressions with the following properties (this example works just as well if you swap A and B):</p>

A is bound in X, and may also be bound in Y</li>
B is bound in Y but not X</li>
(and for reasons that will become apparent next week) If possible, B should not be in the schema with which we evaluate Q.</li>
</ul>
We can replace this equality constraint with a lift in either direction:  (A ^= {B}) or (B ^= {A}), but given the first two constraints, it makes the most sense to replace it with (B ^= {A}).  That way, we can commute the lift all the way to the left and get</p>
Q := X * (B ^= {A}) * Y</pre>
As it turns out, working with simple products is not all that restrictive.  We can treat all the remaining operators as if they were simple terms, and use a handful of other transformations (that I'll get to in two weeks), to deal with the nesting structures inside them (e.g., each term in a sum, or the expression being aggregated).  In other words, this is pretty much the algorithm.  Partition (if possible) each expression into two independent subexpressions that each bind one of the variables on either side of the equality, and then substitute the equality with the relevant lift term.</p>
One other note: When dealing with an equality comparison with a more complicated expression in it, you might have additional restrictions on what you can lift.  For example, you might have:</p>
R(A) * S(B,C) * {A = B * C}</pre>
In which case, you could only substitute in (A ^= {B * C}).  Fortunately, in this case, we can also commute the earlier terms in the expression into an amenable form:</p>
S(B, C) * (A ^= {B * C}) * R(A)</pre>
Alrighty.  Next week, we cover an optimization designed to interact with equality lifting: Unification.</p>


The Viewlet Transform (Part 5: Hypergraph Partitioning)
2012-08-25T00:00:00+00:00
I've been talking for several weeks now about tools and techniques related to AGCA and the viewlet transform.  Most recently, I've been talking about optimization techniques for AGCA, but I'm going to take a quick detour this week and provide a quick overview of another technique: Hypergraph Partitioning.  In general, this technique is most suited for optimizing the materialization process, but there are applications to the optimization of aggregate computations as well.</p>
 </p>
 </p>
 </p>
The Query Hypergraph</h3>
 </p>
 </p>
Before I get into the technique though, we need to discuss an alternate representation of AGCA expressions (one that's actually used pretty frequently in query optimization): the query hypergraph</a> (basically a graph where an edge can connect any number of nodes.  This kind of hypergraph can be created for any product of terms (in the trivial case, we have a product of just one term).  Each node in the hypergraph is a variable/column of the query (both output and input variables are treated identically for this purpose).  Each hyperedge corresponds to one term in the product, and each edge connects all variables that appear in the term corresponding to the edge (regardless of whether they appear as inputs or outputs).  </p>
 </p>
 </p>
Hypergraph Partitioning</h3>
 </p>
Remember that the product operator corresponds to the natural join (and that comparisons are implemented as relations).  As a consequence, any disconnected components in the graph effectively correspond to cross products (a natural join with no shared columns).  For example, consider the following trivial example.</p>
R(A) * S(B)</pre>
R(A) is a hyperedge touching only A.  S(B) is a hyperedge touching only B.  Thus A and B are separate disconnected components.  Note, by the way, that there are no comparisons between A and B in this query.  This product is a pure cartesian cross-product.  The following query would not be:</p>
R(A) * S(B) * {A < B}</pre>
In this query, the term { A < B } connects both A and B.  </p>
Now, if we have disconnected components, it typically pays to materialize them separately.  For example, going with R and S above, we could materialize them as </p>
M( R(A) * S(B) )</pre>
But now we have to store |R| * |S| entries (where |R| is the number of tuples in R).  Worse, if we need to update the materialized view, it will cost us |S| after an update to R, and |R| after an update to S.  On the other hand, we could materialize as</p>
M(R(A)) * M(S(B))</pre>
Now we only store |R| + |S| tuples (between the two materialized views), and updating either can be done in constant time.  Better still, we lose nothing with this representation.  It costs us O(|R|*|S|) to iterate over every element of either materialization of the expression.</p>
You might say that this is a crazy corner case -- people almost never compute cross products.  That's usually true, but in DBToaster, this situation crops up quite frequently.  For example, consider the three way join query:</p>
R(A) * S(A,B) * T(B)</pre>
The (optimized) delta of this query with respect to +S(dA, dB) is</p>
R(dA) * T(dB)</pre>
Because each delta essentially removes a hyperedge in the query hypergraph, partitioned components are created extremely frequently.</p>
 </p>
Partitioning and Trigger Parameters</h3>
There's also one more situation where this is beneficial.  Consider the following query.</p>
R(A) * S(A) * T(A)</pre>
And its delta with respect to the insertion +S(dA)</p>
R(dA) * T(dA)</pre>
Even though dA is touched by both R and T, we lose nothing if we materialize them separately (as before, evaluation is O(1) either way), and materializing them separately results in more efficient maintenance.  In this case, dA is a trigger parameter -- one of the variables drawn from the relation being modified.  These trigger parameter variables can be excluded from the query hyper graph.</p>
 </p>
Applications to Query Optimization</h3>
In general, when computing aggregates, hypergraph partitioning can be used to select a more efficient computation order.  Each materialized component gets scanned independently, and the resulting aggregate can be computed.</p>
 </p>
And that's about it for now.  Next week, we return to AGCA optimization with a discussion of the interplay between equality and lifts, and how to optimize expressions of this form.</p>


Optimizing AGCA (Part 1: Ringing in the optimizations)
2012-08-20T00:00:00+00:00
Although AGCA is designed primarily for incremental query evaluation, it is a fully fledged query language (albeit only for non-aggregate queries and certain kinds of aggregates).  As such, it's useful to have a strategy for optimizing arbitrary query expressions.  As it turns out, optimization is relevant, even in the incremental case, as it can often produce simpler expressions that are easier to incrementally maintain.  Over the next few weeks, I'll discuss several techniques that we've developed for optimizing, simplifying, and generally reducing the cost of evaluating AGCA queries.</p>
But before I get into any of that, let me quickly bring up one point that I've been glossing over, mostly as a point of convenience.  </p>
By default, AGCA expressions are evaluated as the English language is read: Left to Right</strong></p>
This has two consequences.  First, ordering has an impact on query evaluation performance.  We'll be returning to that before long.  For now though, the important feature is that information flows left-to-right in AGCA as well.  Specifically, consider the following expression: </p>
R(A) * {A}</pre>
or in SQL</p>
SELECT SUM(R.A) FROM R</pre>
In the SQL query, you can think of information as flowing from the R table to the SUM operator.  This notion of information flow is pretty common in programming languages, and AGCA incorporates it as well.  I mentioned the idea of binding patterns when I first introduced the special tables used for value expressions (i.e., {A}).  The term R(A) binds the A variable, which is then used by the term {A}.  In short, information is flowing through the product operation from R(A) to {A}.  In AGCA, this information flow is always left to right</strong>.  This is more than just a matter of convenience.  It makes it possible to identify binding patterns in a single scan of the query, rather than an exponential search, which in turn makes many of the optimizations that I will discuss tractable.  </p>
Unfortunately, this also has the side effect of making certain expressions (sometimes) invalid.  For example, the expression</p>
{A} * R(A)</pre>
could not be evaluated in isolation.  However, the expression</p>
S(A) * {A} * R(A)</pre>
is be perfectly valid.  In other words, AGCA's product operation is not generally commutative (A * B ≠ B * A).  Many terms do commute (e.g., R(A) and S(A)), but commutativity is not always possible.  Worse still, the commutativity of two terms can not be determined locally.  As in the above example, {A} and R(A) in isolation do not commute.  However, if the variable A is bound outside of those two terms, then commutativity is possible.  That said, commutativity can be determined locally if the scope in which an expression is evaluated is known (and this information is typically available during optimization).  </p>
From now on, I'll be assume that commutativity can be easily determined.  </p>
As I present each of these rules, I'll briefly discuss the core ideas of each optimization, comment on how the rule interacts with both incremental and batch query evaluation, and then summarize the rules as a set of transformations over AGCA expressions.  </p>
Pre-evaluation</h3>
A relatively straightforward, practically braindead optimization that appears in nearly all compilers is constant folding.  If an expression such as 1+2 is fed into the compiler, the compiler silently turns that into a constant 3.  DBToaster does this, but there are several nuances that arise in the DBToaster case.  </p>
Firstly, as I've mentioned before, AGCA is a ring.  Thus, the constants 1 and 0 have some useful properties.  When an expression is multiplied by 1 or added to 0, the result is unchanged, so we can eliminate any appearance of {1} in a product, or {0} in a sum.  Furthermore, whenever {0} appears in a product, the entire product term can be replaced by {0}.  Although it might seem unlikely that such a query would ever arise, {0} terms actually appear quite often in the incremental processing case.  The delta rule produces a {0} for a large number of different terms, and queries with nested subqueries (which I'll get to one of these days) often produce queries with a ({1} + {-1}) term.</p>
A second nuance is value expressions themselves.  Practically speaking, the following two expressions are identical</p>
{A}+{2} == {A+2}</pre>
For a number of reasons, the latter form is considerably more efficient most of the time.  I'll get into the details in two weeks when I talk about a materialization optimization called Hypergraph Factorization that I'll talk about next week, but essentially, we (almost) always want to put value terms together.</p>
The Rules</span></p>
{0} + Q => Q</p>
{0} * Q => {0}</p>
{1} * Q => Q</p>
{X} + {Y} => {X + Y} (see caveat for Hypergraph Factorization)</p>
{X} * {Y} => {X * Y} (see caveat for Hypergraph Factorization)</p>
{f(X, Y, Z, …)} => {eval(f(X, Y, Z, …))}</p>
Polynomial Factorization</h3>
One additional feature of rings is that the product operation distributes over union.  In other words</p>
A * (B + C) <=> (A * B) + (A * C)</pre>
This operation goes both ways, and we can take advantage of that to reduce our workload. Whenever we encounter an expression like (A * B) + (A * C), we can rewrite it as A * (B + C), and save ourselves a re-evaluation of A.  This is particularly important for incremental processing, as the delta operation frequently produces expressions of this form (and better still, B and C are typically value terms that can be further optimized by pre-evaluation).</p>
This optimization bears similarity to both a programming language technique called common subexpression elimination, as well as tradition arithmetic factorization (hence then name).  That said, there's a nuance in factorizing AGCA expressions that doesn't arise in arithmetic factorization: commutativity.  Let's say we have an expression of the form</p>
(A * B) + (C * A)</pre>
If this were an arithmetic polynomial, clearly we could factorize out the A.  Unfortunately, in AGCA, terms don't always commute.  This leaves us with two possibilities: Either we can commute the A with the C and produce the following factorized expression:</p>
A * (B + C)</pre>
Or we can commute the A with the B and produce the following factorized expression:</p>
(B + C) * A</pre>
Note that in the latter case, the A appears after the factorized terms in the expression.</p>
In general, factorization is a hard problem.  In any given polynomial, there might be several terms that could be factorized out of the expression.  For example</p>
(A * B) + (A * C) + (D * B)</pre>
This expression can be factorized out to one of the two following expressions</p>
(A * (B + C)) + (D* B)</p>
 ((A + D) * B) + (A * C)</p>
It's not always clear which of these will be more efficient to evaluate.  A simple heuristic is to factorize out terms that are guaranteed to involve a cost (e.g., table terms), but a cost-based optimizer is typically the most effective.   We'll get to that in a few weeks.</p>
The Rules</span></p>
 </pre>
(… * A * …) + (… * A * …) + … => A * (… * {1} * …) + (… * {1} * …) + …) (if A commutes to the head of each term)</p>
(… * A * …) + (… * A * …) + … => (… * {1} * …) + (… * {1} * …) + …) * A (if A commutes to the tail of each term)</p>
 </p>
That's it for this week.  Next week, I'll jump back to the optimized viewlet transform with Hypergraph Factorization.</p>


The Viewlet Transform (Part 4: Input Variables and Partial Materialization Continued)
2012-08-12T00:00:00+00:00
Last week, I talked about a variation on the core viewlet transform idea.  The delta operation introduces input variables into a query, which can not be properly materialized.  Often, these input variables can be eliminated through variable unification (something I'll start getting into in a week or two), but not always.  In these cases, it is necessary to materialize the delta query in parts.  </p>
We do this by splitting a query Q into two (or more) parts Qmain, and Q1, Q2, … etc.  We materialize Q1, Q2, … etc, and then whenever we need to evaluate Q, we compute Qmain(Q1, Q2, …).  We express this partitioning by a special materialization operator M, and recur through the query expression to find the exact bits we can materialize.</p>
Before I actually get into the partial materialization process, let me quickly introduce four bits of nomenclature regarding queries (that I'll define more thoroughly next week): inputs</strong>, outputs</strong>, scope</strong>, and schema</strong>.  The inputs and outputs of an expression are the unbound and bound (respectively) variables appearing in the expression.  The scope of an expression is the set of variables that are bound when the expression is evaluated.  The schema of an expression is the set of variables that an expression is expected to bind.  Note that while the inputs and outputs of an expression are uniquely identified by the expression, scope and schema are contextual, and can change depending on how the expression is evaluated.  Also note that any inputs must</strong> be in the scope, and that anything in the schema must</strong> be in the outputs of a query.</p>
That said, we eliminate input variables through a recursive process that starts with a fully materialized query</p>
M(Q)</pre>
and recursively descend into the expression using the following rules, until the expression chosen to be materialized (the materialization decision) has no inputs.</p>
 </p>
Materializing AggSums</h3>
M(AggSum([…], Q))</pre>
If we have an AggSum that we need to materialize, one of two things can happen.  If the AggSum has no inputs, we're done.  If the AggSum has input, then we need to recur, and push the materialization operator inside the AggSum. </p>
AggSum([…], M(Q))</pre>
As I mentioned last week, we can actually do better. As we push the materialization operator down into the AggSum, we keep track of the variables used by the AggSum.  Note that the schema of the query nested inside the AggSum (Q, that is) must be identical to the group-by variables of the AggSum.  As we push the materialization operator down into the query, we keep track of its schema.  When we finally settle on a location for the materialization operator, we look at both its schema and its outputs.  If there are more outputs than the schema calls for, we add an additional AggSum to trim the unnecessary outputs away.  </p>
Materializing Relations, Value Expressions, and Comparison Predicates</h3>
M(R(A, B, …))</pre>
M({f(A, B, …)})</pre>
M({f(A, B, …) θ g(A, B, …)})</pre>
Relations never have inputs and are always materialized.  Value expressions and comparisons are exclusively inputs, and are never materialized alone (i.e., unless bound by a relation)</p>
Materializing Unions and Joins</h3>
M(A + B + …)</pre>
M(A * B * …)</pre>
As with AggSums, unions or joins with no inputs are always materialized in their entirety.  For both unions and joins, we first partition the expression into a subset of the expression with no inputs (A, B, …), and multiple subsets with inputs (C, D, E, …).  We materialize the input-free bit as is, and recursively descend into the remaining components</p>
M(A + B + …) + M(C) + M(D) + …</pre>
M(A * B * …) * M(C) * M(D) * …</pre>
There are a few caveats for materializing joins.  Specifically, there can be ordering constraints (which I'll get into next week) over the terms of a join (they don't quite commute).  It may sometimes be necessary to partition the expression into multiple subsets for materialization if there is a term (let's call it C) that must occur to the right of (A*B), and to the left of (D*E), then we would choose to materialize it as</p>
M(A * B) * M(C) * M(D * E)</pre>
Materializing Lifts</h3>
M(A ^= {B})</pre>
A lift is tricky in that it involves both input and output variables.  Typically, the lift will get unified away (again, something I'll talk about in a few weeks).  Other times, it may be possible to include the lift in an input-free query if another relation binds the inputs of the lift (in which case it'll get caught by the Union/Join case).  In either case, if we've gotten to this point, the best we can do is to not materialize anything.</p>
When I get into nested subqueries, and we start using lifts for more complex things, this rule will need to be changed.</p>
 </p>
Alright.  I know this week was short (and a bit cheap), but the CIDR deadline's this weekend.  Next week, I'll get back to some more interesting stuff, with optimization techniques for AGCA.  Until then, cheers.</p>
 </p>
 </p>
 </p>


The Viewlet Transform (Part 3: Input Variables and Partial Materialization)
2012-08-06T00:00:00+00:00
For the past few weeks, I've been discussing the viewlet transform.  The key idea of this process is that because the delta transform is closed over AGCA (that is, it doesn't add any funny business to the query), it's possible to materialize and incrementally maintain the deltas of a query just as easily as we can maintain the original query.  Because the deltas of a query are materialized as their own views, practically no processing is required to incrementally maintain the query; we just read from the delta view and update the original query accordingly.</p>
This process continues recursively.  The delta queries each have their own deltas -- we'll call these second-order deltas of the original query. The second-order deltas, of course, each have third-order deltas, and so forth.  This continues, building up a hierarchy of auxiliary views, each used to efficiently maintain their parents in the hierarchy.  Even though this hierarchy has many views in it, it's still typically possible to maintain them efficiently.</p>
Of course, there are always corner cases.  This week, I'm going to discuss one of them: Input Variables.  Let's have a look at a relatively straightforward query:</p>
Q := AggSum([], R(A,B) * S(C,D) * {A < C})</pre>
Or in SQL</p>
SELECT COUNT(*) AS Q FROM R, S WHERE R.A < S.C</pre>
Innocuous as this query is, when we take its delta, we run into a problem.  Let's take the delta with respect to R (the delta with respect to S is nearly the same).  </p>
dR(<X,Y>) Q := AggSum([], S(C,D) * {X < C})</pre>
In other words, whenever we insert a tuple (row) into R, we need to run the above query, substituting X and Y with values from the tuple being inserted into R.  dR Q</span> is certainly simpler, but there's a problem.  Recall that special tables</em> (like {A < C} or {X < C}) have an infinite number of rows.  This was fine in the original query Q, because the query had terms that limited the number of distinct values of A and C that we were interested in.  The special table might have had an infinite number of rows, but the overall query did not.  </p>
In the delta query however, there's no term limiting the number of distinct values of X that we're interested in.  It's ok when we actually evaluate the delta query because we have a value of X (from the tuple being inserted into R), but until we get that value we can't actually compute a value for it.  In other words, we can't store the results of the query.  </p>
There's actually a term for this in programming languages and query processing: X is known as an unbound</strong> or unsafe</strong> variable (or sometimes a range restricted variable).  AGCA calls it an input variable</strong> (or parameter) of dR Q</span>.  We don't know what it is, so we can't evaluate the expression.  </p>
There are a few things we can do in this situation.  If you're particularly familiar with query processing techniques, you might look at this query and say "But wait, we can actually materialize this using a range tree (or similar index structure)."  And you'd be right.  Of course, then I could give you a more complex query (e.g., replace the inequality with an arbitrary black box function f(A, B)) and we'd be right back where we started.  For now, let's assume that it's simply impossible to materialize the entire expression in one go.  </p>
So what else  is left?  Well, if we can't materialize the entire thing, then what about materializing it in bits?  We can create one or more views that can</strong> be stored efficiently, and then do some (but not all) of the heavy lifting afterwards, once we actually know what X is.</p>
Let's make this a bit clearer and procedural.  We have a query</p>
AggSum([], S(C,D) * {X < C})</pre>
Now I'm going to introduce an extra little bit of syntax into AGCA.  We'll call it the materialization operator M.  Everything in the materialization operator is going to get materialized.  Everything outside of the materialization operator is going to be evaluated when the query results need to be accessed.  An AGCA query with a materialization operator in it is called a materialization decision</strong>.</p>
We arrive at a final materialization decision by starting with the default (naive) decision where we materialize everything.</p>
M(AggSum([], S(C,D) * {X < C}))</pre>
… and then iteratively refining the decision until we arrive at a satisfactory one.  As I've been saying, a materialization decision with an input variable is not valid, so we need to rewrite it.  Input variables only appear in these special tables (the ones in the curly braces), so the basic idea is actually pretty easy.  We'll start by pushing the materialization operator inside the AggSum:</p>
AggSum([], M(S(C, D) * {X < C}))</pre>
Now we're looking at a materialization decision applied to a product.  We can split the materialization operator across the elements of the product so that only the parts without input variables get materialized</p>
AggSum([], M(S(C,D)) * {X < C})</pre>
And we're there.  This materialization decision is valid, but not quite as efficient as it could be.  </p>
Specifically, look at what we're storing: S(C,D).  We care about the individual values of C (because they get applied to the predicate {X < C}), but D is never used, and will actually get aggregated away.  We can save ourselves a little trouble when we need to evaluate the delta query by storing only an aggregated value.  In other words, We can tack on an extra aggsum.</p>
AggSum([], M(AggSum([C], S(C,D))) * {X < C})</pre>
Note that C is a group-by variable of this AggSum because it is needed by the predicate.</p>
Alright.  Hopefully that gives you a bit of the flavor of rewriting queries to support input variables.  The details of this process are actually quite messy, but I'll see if I can cover them in detail next week.  For the impatient, our VLDB paper "DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views" gives a reasonable overview of the process.</p>


The Viewlet Transform (Part 2: The Naive Viewlet Transform)
2012-07-29T00:00:00+00:00
Last week, I introduced you to how deltas work in AGCA.  To recap</p>
∂T(A,B,C,…)       = {+/- 1} * (A ^= {X}) * (A ^= {Y}) * (A ^= {Z}) * …</pre>
∂S(…)             = {0}</pre>
∂(Q1 + Q2)        = ∂Q1 + ∂Q2</pre>
∂(Q1 * Q2)        = (Q1 * ∂Q2) + (∂Q1 * Q2) + (∂Q1 * ∂Q2)</pre>
∂(AggSum([…], Q)) = AggSum([…], ∂Q)</pre>
∂({…})            = {0}</pre>
∂(V ^= {…})       = {0} (for the simplified Lift operation only)</pre>
Now, we've been down in the nitty gritty of AGCA for a while now.  Let's pop our heads up for a moment to remember where we're going with all of this.</p>
We have a query (let's call it Q) and we want to be able to incrementally maintain it.  That is, we want to store a copy of the results of evaluating Q on a database (stored on disk, in memory, anywhere really), and every time the database changes in some way, we want to update the stored results to match</strong>.</p>
Applying the Delta Transform</h3>
That's fairly easy to do somewhat efficiently if we have these deltas.  Let's say we have the query
Q := AggSum([], R(A,B) * S(B,C) * A)</pre>
or in SQL
SELECT SUM(A) AS Q FROM R, S WHERE R.B = S.B;</pre>
Let's say R contains
___A__B______#__</span></pre>
 < 1, 1 > -> 1</pre>
 < 1, 2 > -> 1</pre>
 < 2, 2 > -> 1</pre>
and S contains
___B__C______#__</span></pre>
 < 1, 1 > -> 2</pre>
 < 2, 2 > -> 1</pre>
For this data, Q = 5.
Let's say we insert a new row: S(2,1).  We could certainly re-evaluate the entire query from scratch and discover that the new result is Q = 8, but this would be pretty inefficient.  Even in the best case, where everything fits in memory, re-evaluating the join requires O(|R| + |S|) work.  That's where the deltas come in.  The delta of Q tells us how the query results change with respect to a change in the table.  So let's take the delta of Q with respect to an insertion of tuple <@Y,@Z> into S.</p>
∂Q := ∂(AggSum([], R(A,B) * S(B,C) * {A})</pre>
   := AggSum([], ∂(R(A,B) * S(B,C) * {A}))</pre>
   := AggSum([], R(A,B) * ∂(S(B,C) * {A})</pre>
                 + ∂R(A,B) * (S(B,C) * {A})</pre>
                 + ∂R(A,B) * ∂(S(B,C) * {A}))</pre>
   := AggSum([], R(A,B) * ∂(S(B,C) * {A)</pre>
                 + {0} * (S(B,C) * {A})</pre>
                 + {0} * ∂(S(B,C) * {A}))</pre>
   := AggSum([], R(A,B) * ∂(S(B,C) * {A}))</pre>
   := AggSum([], R(A,B) * (S(B,C) * ∂{A}</pre>
                           + ∂S(B,C) * {A}</pre>
                           + ∂S(B,C) * ∂{A}))</pre>
   := AggSum([], R(A,B) * (S(B,C) * {0}</pre>
                           + ((B ^= {@Y}) * (C ^= {@Z}) * {A})</pre>
                           + ∂S(B,C) * {0}))</pre>
   := AggSum([], R(A,B) * (B ^= {@Y}) * (C ^= {@Z}) * {A})</pre>
I'm not going to get into the details of optimizing AGCA expressions (yet), but trust me for now that the following (simpler) query is equivalent</p>
∂Q := AggSum([], R(A,@Y) * {A})</pre>
or in SQL</p>
SELECT SUM(A) FROM R WHERE R.B = @Y;</pre>
Note, by the way, that @Y is a parameter to this delta query.  When you evaluate a delta query (for example our delta for insertions into S), these parameters take their value from the tuple being modified (so when you insert <2,1> into S, then @Y = 2).  That said, @Y is just a normal variable/column.  There's nothing special about it (other than the @ in the name).</p>
The delta query tells us how the query results change.  If we insert <2,1> into S, then we evaluate the delta query for insertions into S (∂Q above), setting @Y to 2, and @Z to 1.</p>
AggSum([], R(A, 2) * {A})</pre>
… which, for our initial dataset above, gives us ∂Q = 3.  To figure out what Q will give us on the modified database (after inserting <2,1> into S), we just add ∂Q to our initial result (5 + 3 = 8).</p>
Parameters and AggSums</h3>
I keep saying that parameters just normal variables, and that there's nothing special about them.
That's mostly true.  I actually oversimplified a bit on the delta rules.</p>
We want these parameters to be visible from the outside so that evaluating ∂Q for a specific insertion (or deletion) essentially amounts to selecting a single row from the output of ∂Q.  In other words, the AggSums need to be rewritten slightly so that the parameters appear in the group-by variables (where appropriate).  That is, the correct delta with respect to an insertion into S : +S(@Y,@Z) is</p>
AggSum([@Y], R(A, @Y) * {A})</pre>
or in SQL</p>
SELECT R.B, SUM(R.A) FROM R GROUP BY R.B;</pre>
That little hiccup out of the way, let's get to the actual viewlet transform</p>
Auxiliary Views</h3>
The delta query is an improvement over evaluating the entire query from scratch.  For this particular example though, we still need to scan over multiple rows of R (even if it is only a small subset of R).  We can do even better.
Right now, every time the database changes we update Q with ∂Q.</p>
ON +S(@X, @Y)</pre>
  Q += AggSum([@Y], R(A,@Y) * {A})</pre>
But recall that for any AGCA query Q, ∂Q is just an ordinary, simple, unexceptional AGCA query (no funny business is introduced by the ∂).  If we can store ('materialize' to use the technical term) the results of Q, what's to stop us from storing the results of ∂Q?  Nothing!</p>
Let's say we had another view materialized (let's call it M_S), this time with a group by variable:</p>
M_S[Y] := AggSum([Y], R(A, Y) * {A})</pre>
For the initial dataset above, this would contain</p>
___Y______#__</span></pre>
 < 1 > -> 1</pre>
 < 2 > -> 3</pre>
This view can help us substantially when we need to update Q after an insertion into S.  Expressing this update as a trigger:</p>
ON +S(@Y, @Z)</pre>
  Q += M_S[@Y]</pre>
In other words, to update Q, we just need to look up one row of M_S.  The update can be done in constant time!  That said, we now have an extra view that we need to maintain.  Fortunately, M_S is simpler than Q, and has only one table, in this case R.  Whenever R changes, we need to update M_S.  Since M_S is defined in terms of a normal, ordinary AGCA expression, we update it in exactly the same way that we update Q, using the delta of M_S.  For an insertion of <@X,@Y> into R, this would be:</p>
∂M_S[Y] := AggSum([Y], R(A,Y) * {A})</pre>
        := AggSum([@X, @Y, Y], (A ^= {@X}) * (Y ^= {@Y}) * {A})</pre>
        := (Y ^= {@Y}) * {@X}</pre>
Or, expressed as a trigger</p>
ON +R(@X, @Y)</pre>
  M_S[Y] += (Y ^= {@Y}) * {@X}</pre>
This can be simplified a bit, since the update operation only produces one row</p>
ON +R(@X, @Y)</pre>
  M_S[@Y] += {@X}</pre>
(also a constant time operation)</p>
Recursive Deltas</span></p>
What's happening here is that we're saving the results of Q and maintaining them with (several instantiations of) ∂Q.  The key idea of the viewlet transform is that we can also save the results of ∂Q and maintain them with (several instantiations of) ∂(∂Q).  This process repeats recursively, giving us ∂(∂(∂Q)), ∂(∂(∂(∂Q))), and so on.</p>
Every time we add another ∂, another table drops out, making it the delta query simpler.  After enough repetitions, we end up with a query that doesn't depend on the database at all (e.g., ∂M_S above).  At this point, we can stop, since the query can be evaluated in constant time.</p>
This is the viewlet transform.  Start by materializing the original query, and then alternate computing its delta(s), and recursively materializing the delta(s)</strong>.</p>
I've obscured the issue a bit by not subscripting my ∂s, remember that each delta is taken with respect to a particular event.  Q has four deltas: for both insertion and deletion of both R and S (∂+R</span>, ∂-R</span>, ∂+S</span>, ∂-S</span>).  Similarly, M_S has two deltas: for both the insertion and deletion of R and S.</p>
The viewlet transform of Q produces five views: one for Q, and one for each of the first-tier deltas (the second-tier deltas are all constants).</p>
As it turns out, we can be even more efficient than that!  Furthermore, the procedure I describe above runs into problems with special tables (as a bit of a teaser for this, take a close look at the delta of Q with respect to insertions into R).  This naive</em> viewlet transform is insufficient.  Next week, I'll start discussing some changes to the naive viewlet transform that make it more practical and efficient.</p>


The Viewlet Transform (Part 1: Deltas in AGCA)
2012-07-19T00:00:00+00:00
Over the last few weeks, I've been covering various aspects of AGCA, the language for incremental processing behind DBToaster.  Now, I'm going to chat a bit about the heart of DBToaster: the viewlet transform.  </p>
The basic idea behind viewlet transform is actually something that's been around for a very long time: delta queries, commonly used for Incremental View Maintenance.  Let's say you have a query Q</em>. If you need to evaluate Q</em> over and over again, it usually makes sense to evaluate it once and just store the results somewhere.  </p>
That's great, but if the data that goes into Q</em> changes, you need to update the stored results accordingly.  However, instead of re-evaluating the entire query from scratch, you can compute what's known as a delta query.  The Delta of Q</em> (with respect to table T</em>) is a simplified form of Q</em> that tells you how the results of Q</em> will change (when you apply some change ∂T</em> to table T</em>).  Put algebraically:</p>
Q(T) = Q(T+∂T) + ∂Q(T, ∂T)</em></p>
The idea is that computing ∂Q</em> is more efficient, and generally faster than computing Q</em> from scratch.  </p>
Let's have a look at how AGCA interacts with this delta operation.  Recall that everything in AGCA is multiplicities.  We just need to figure out which rows change, and what the change in their multiplicities will be.  To start, we're going to work with updates to one table at a time, and updates to one row of that table at a time (note that this may still result in multiple rows changing in the result, but the input data will only change by one row at a time).  </p>
Deltas of Tables</h3>
Concretely, what happens to the results of an AGCA query when we insert a single row, with values <X, Y, Z, …></em> into table T</em>?</p>
If the table is the one being updated, then the change is just the single row being inserted.  We can construct a singleton rows in AGCA by using the lift operation, just like I described last week.</p>
∂T(A,B,C,…) = (A ^= X) * (B ^= Y) * (C ^= Z) * …</pre>
If the table isn't the one being updated, then there is no change at all.  In other words, we need a special table with every single row having a multiplicity of 0.  We know how to do that.</p>
∂S(…) = {0}</pre>
The delta for a deletion is similar.  Again, remember that AGCA uses multiplicities for everything.  If we're deleting a row from table T</em>, we want to reduce the multiplicity of that row by 1.  How do we do this?  We add a negative multiplicity</p>
∂T(A,B,C,…) = {-1} * (A ^= X) * (B ^= Y) * (C ^= Z) * …</pre>
Deltas of Bag Unions</h3>
What about bag unions?  What if we have an expression like </p>
Q = Q1 + Q2</pre>
Well, let's assume that we can figure out an expression for computing the deltas of Q1 and Q2.  Then we know that </p>
Q + ∂Q = (Q1 + ∂Q1) + (Q2 + ∂Q2)</pre>
AGCA is a ring, so the normal rules (distributivity, associativity, commutativity, etc…) for + and * apply to bag union and natural join as well.  So, reshuffling a bit, we get</p>
Q + ∂Q = (Q1 + Q2) + (∂Q1 + ∂Q2) = Q + (∂Q1 + ∂Q2)</pre>
In other words</p>
∂(Q1 + Q2) = ∂Q1 + ∂Q2</pre>
Deltas of Natural Joins</h3>
We can do something similar for natural joins.</p>
Q + ∂Q = (Q1 + ∂Q1) * (Q2 + ∂Q2) = (Q1*Q2) + (Q1*∂Q2) + (∂Q1*Q2) + (∂Q1*∂Q2)</pre>
So.</p>
∂Q = (Q1*∂Q2) + (∂Q1*Q2) + (∂Q1*∂Q2)</pre>
This one's a bit stranger, so let's look at it in a bit more detail.  If only Q1</em> changes (but not Q2</em>), then ∂Q2</em> = {0}.  So…</p>
∂Q = (Q1*{0}) + (∂Q1 * Q2) + (∂Q1*{0})</pre>
Again, we benefit from AGCA being a ring.  {0} is AGCA's additive identity (a fancy way of saying that {0} + X = X, and also that {0} * X = {0}).  So...</p>
∂Q = {0} + (∂Q1 * Q2) + {0} = ∂Q1 * Q2</pre>
This kinda makes sense.  A join takes every row of one table and matches it against every row of the other table.  If you insert a row into one of the two tables (for example, inserting ∂Q1 into Q1), the change to the final result comes from joining it against every row of the other table (Q2 in this case).  The exact same thing happens if you only insert into Q2, but not Q1.</p>
So what if both Q1 and Q2 change.  For example, if query could be</p>
Q = T(A,B) * T(B,C) </pre>
This is also known as a self-join.  If we insert a row into T, then there'll be three parts to the update:</p>

The inserted row joined against all of the T(B,C)s (∂T(A,B) * T(B,C))</li>
The inserted row joined against all of the T(A,B)s (T(A,B) * ∂T(B,C))</li>
And there's a possibility that the inserted row might join against itself. (∂T(A,B) * ∂T(B,C))</li>
</ol>
Deltas of Special Tables</h3>
The delta of one of these special tables we use for constants, numerical formulas, or comparisons is always {0}.  This may seem a little unintuitive, but it actually makes sense.  Let's say we have the query </p>
Q = T(A) * {A^2}</pre>
If we insert a row <3> into A, that's going to change T, but it won't change the fact that (3^2 = 9).  The row <3>->9 is always present in the special table {A^2}, regardless of whether or not it's in T.  So we'd get</p>
∂Q = {0} + (∂T(A) * {A^2}) + {0} = ∂T(A) * {A^2}</pre>
Which is precisely correct.</p>
Deltas of AggSums</h3>
The delta of an AggSum is the AggSum of the delta.</p>
∂AggSum([…], Q) = AggSum([…], ∂Q)</pre>
The reasoning behind this is identical to the reasoning behind the delta of bag union (+), since AggSum uses precisely the same mechanics.</p>
In Summary</h3>
I'm going to punt on the delta of a Lift expression for now.  There's a bit of hidden complexity there that I'll return to in two weeks.  For now, the simplified form of the Lift operation that I've described so far always has a delta of {0}.</p>
So the full list of delta rules (minus the lift) for the delta of an update to T is</p>
∂T(A,B,C,…)       = {+/- 1} * (A ^= X) * (A ^= Y) * (A ^= Z) * …</pre>
∂S(…)             = {0}</pre>
∂(Q1 + Q2)        = ∂Q1 + ∂Q2</pre>
∂(Q1 * Q2)        = (Q1 * ∂Q2) + (∂Q1 * Q2) + (∂Q1 * ∂Q2)</pre>
∂({…})            = {0}</pre>
∂(AggSum([…], Q)) = AggSum([…], ∂Q)</pre>
Note two things about this.  First, the delta of a query expressed in AGCA is itself a query in AGCA.  This is a huge deal. Prior to AGCA, delta queries had special funny business that required special logic in the database to process.  In other words, if you don't need any special query-processing infrastructure to process the delta of an AGCA query, you just need support for AGCA itself.</p>
The second distinction is that the delta query is simpler.  Roughly speaking, every time you take the delta of a query in AGCA, you remove one table from each join.  In other words, if you have an AGCA query, and you take its delta, and the delta of that, and so forth, eventually you'll end up with {0}.  </p>
These two facts are critical to the Viewlet Transform, which I'll finally get to next week.</p>


AGCA Summary
2012-07-14T00:00:00+00:00
For the past month, I've been talking about AGCA, a language for incremental processing. Next week, I'll go into AGCA's primary application: The viewlet transform. But before I get to the transform, I'm going to do a quick overview of AGCA so that the basics of the language are all in one post.</p>
I also lied a bit... I want to introduce one more (extremely) concept: A simplified form of an operation AGCA calls Lift.</p>
Something that's required for any real query language is the ability to define tuples inline. This is something you might do in SQL as</p>
(SELECT 1 AS A)</pre>
AGCA uses the Lift operation for this purpose:</p>
(A ^= 1)</pre>
Think of the lift operation as variable assignment in a programming language. It creates a single-column relation with a single row in it</p>
___A______#__</pre>
 < 1 > -> 1</pre>
Lift can be combined with the Natural Join to construct arbitrarily wide single-row relations. For example:</p>
(SELECT 1 AS A, 2 AS B, 3 AS C)</pre>
would be expressed in AGCA as</p>
(A ^= 1) * (B ^= 2) * (C ^= 3)</pre>
Lift and Union combine similarly to create multiple row relations.</p>
(A ^= 1) + (A ^= 2) + (A ^= 3)</pre>
We'll need the lift when we talk about incrementality. That said, let's get to the summary.</p>
Relation (Table)</h3>
NAME(COL1, COL2, ...)</pre>
Represents the contents of a base relation (aka a Table). The output is a mapping from every distinct row of the relation to the tuple's multiplicity in the relation. If the same relation appears more than once in the same expression, each occurrence of the relation can have different column names.</p>
Natural Join</h3>
A * B</pre>
The natural join of the relations defined by expressions A and B. Every row in the output of A will be matched with every row in the output of B that has the same values for columns with the same name. If there are no columns with the same name, this is effectively the cartesian cross-product. For every row in A matched with a row in B, a single row with a multiplicity equal to the product of the two matched rows will be output.</p>
Bag Union</h3>
A + B</pre>
The bag union of the relations defined by expressions A and B. In general, A and B should have the same schemas (although AGCA does support the case where they don't). Every row of the output has a multiplicity equal to the sum of the row's multiplicities in A and B.</p>
Sum/Count Aggregate & Projection</h3>
AggSum([col1, col2, ...], A)</pre>
The Sum aggregate (grouping by col1, col2, ...) of A. This is equivalent in AGCA to projecting away everything except for col1, col2, ...etc. The output rows have schema col1, col2, ..., and any given row in the output has a multiplicity equal to the sum of all rows that got projected down to the output row.</p>
Value Expression</h3>
A * { f(var1, var2, ...) }</pre>
When applied to an expression A by the natural join, multiplies each row's multiplicity by an arbitrary single-valued function f over columns var1, var2, ... of A.</p>
Comparison Predicate</h3>
A * { f(var1, var2, ...) θ g(var1, var2, ...) }</pre>
When applied to an expression A by the natural join, filters out rows that do not satisfy the predicate (f θ g) where θ is a comparison operation, and f and g are arbitrary single-valued functions over columns var1, var2, ... of A.</p>
Lift</h3>
(var ^= value)</pre>
Outputs a single row with column named var, with the indicated value, and a multiplicity of 1.</p>
 </p>


DBToaster Released
2012-07-11T00:00:00+00:00
Go get your copy at http://www.dbtoaster.org</a></p>
What are you waiting for?  Go!</p>
Seriously, stop reading this.</p>


AGCA, The language of change (Part 4: The table special)
2012-07-08T00:00:00+00:00
For several weeks now, I've been describing AGCA, a language for incremental processing.  </p>
In part 1</a>, I covered the core idea behind AGCA: Instead of storing data as a list of rows, AGCA keeps only one copy of each unique row, and tags it with the number of times it appears.  In other words, data in AGCA is a function that maps each row to the number of times that it appears in the data (the row's multiplicity).  </p>
In part 2</a>, I showed you how AGCA handles unions (it sums multiplicities, so we call it +) and joins (it multiplies multiplicities, so we call it *).</p>
Most recently, in part 3</a>, I showed you how AGCA handles the COUNT(*) aggregate, and how COUNT(*) and projection are actually the same thing in AGCA.  After columns are projected away, the results might end up containing duplicate rows -- so the multiplicities of these rows get added together to produce the final result.</p>
This week, I plan to talk about the SUM() aggregate, and selection (filtering) in AGCA.  Let's start with SUM().  An interesting thing to note about COUNT(*) is that it's a special instance of SUM().  Specifically</p>
SELECT COUNT(*) FROM R;</pre>
is completely equivalent to </p>
SELECT SUM(1) FROM R;</pre>
In other words, for every row that appears in the result, we add the numerical value '1' to the final result.  Straightforward enough, right?  But what if we wanted to add a more complex/interesting value to the result?  Say we wanted to compute the following SQL query</p>
SELECT SUM(2) FROM R;</pre>
In other words, for every copy of every row that appears in the result, we add the numerical value '2' to the final result.  Just to recap, the AGCA for the COUNT(*) of R is:</p>
AggSum([], R(A,B))</pre>
This query takes the multiplicity of every row of R, and sums them, giving us SUM(1) of R.  In order to compute SUM(2), we'd have to multiply those multiplicities by 2.</p>
As I pointed out in week 2, joins multiply multiplicities.  So what we need is a special kind of table. We need a table with one row with a multiplicity of '2', which matches with every row of R.  We need a table that looks like this:</p>
________#__</pre>
 < > -> 2</pre>
Because there are no columns in this special table, it joins with every row in any table/query you can come up with.  We can call this special table {2} when writing down AGCA queries:</p>
R(A,B) * {2}</pre>
Which says "Give me two copies of every row of R".  Incidentally, as I mentioned before, AGCA forms an algebraic structure called a ring.  One of the properties of a ring is distributivity.  That is,</p>
R(A,B) * {2} = R(A,B) * ({1}+{1}) = R(A,B) + R(A,B)</pre>
Or in other words, two copies of everything in R is identical to the union of R with itself (which is, of course, true).  So in summary, the AGCA query to compute the SUM(2) of R is</p>
AggSum([], R(A,B) * {2})</pre>
What's cool about this is that one of these special tables can be created for any</strong> number that we want to sum up.  AGCA is not just limited to positive integers, so there's nothing wrong with the AGCA query:</p>
R(A,B) * {-1}</pre>
or even the query</p>
R(A,B) * {3.14159265}</pre>
Let that sink in for a moment, because if this doesn't weird out out you're probably missing something</strong>.  I've been talking about multiplicities as the number of times a row occurs in a data set.  AGCA plays multiplicities a lot more fast and loose.  An AGCA query result can include fractions of rows, or even rows with negative multiplicities -- a sort of anti-row:</p>
R(A,B) + R(A,B) * {-1} = R(A,B) * ({1} + {-1}) = R(A,B) * {0}</pre>
Negative multiplicities will turn out to be super-useful for incremental maintenance, as you'll see.</p>
Moving on, the ability to multiply multiplicities by constants is useful, but what we really need is the ability to compute sums with variables in them.  For example:</p>
SELECT SUM(A) FROM R;</pre>
In other words, for every copy of every row that appears in the result, we want to take the value of column 'A' and add it to the final result.  AGCA actually has a special table for this as well.  If we call that table {A}, the AGCA for SUM(A) of R is</p>
AggSum([], R(A,B) * {A})</pre>
So what does {A} look like?  Well, let's say we have some example data in R:</p>
___A__B______#__</pre>
 < 1, 1 > -> 2</pre>
 < 2, 2 > -> 1</pre>
 < 2, 5 > -> 3</pre>
We want to turn this into:</p>
___A__B______________#__</pre>
 < 1, 1 > -> 2 * 1 = 2</pre>
 < 2, 2 > -> 1 * 2 = 2</pre>
 < 2, 5 > -> 3 * 2 = 6</pre>
{A}, when joined with R(A,B) should multiply the rows where A is 1 by 1, where A is 2 by 2, and so forth…  In other words, we want a table that can be joined with R(A,B) on A.  A table with one row for every possible value of A in its 'A' column, and the same value as each row's multiplicity.</p>
___A______#__</pre>
 < 1 > -> 1</pre>
 < 2 > -> 2</pre>
 < 3 > -> 3</pre>
     ...</pre>
Unlike the special table for constants, the special table for variables actually has an infinite number of rows in it.  As I've mentioned before, technically, all AGCA queries produce an infinite number of results, but only a (relatively) smaller number of 'interesting' ones (in technical terms, the query results have 'finite support').  That's not the case with this kind of special table.  There are an infinite number of rows. So a query like:</p>
{A}</pre>
Would produce an infinite number of rows.  Of course, such a query doesn't really make sense either.  You're always going to join this kind of special table with a table that doesn't produce an infinite number of (interesting) rows, and that has 'A' as a column.  When you do that, all but a few of the rows of {A} will get zeroed out.  Try it yourself on the SUM(A) example if you're not convinced.</p>
By the way, what I am describing is identical to a concept in programming languages/logic called bound and free variables (also known as safe/unsafe variables in query processing).  'A' is a free variable in {A}, and a bound variable in R(A,B).  Just like in SQL, a query with any free/unsafe variables in it has to have an external assignment of values to those variables before it can be evaluated.  More on that later.</p>
On a related note, AGCA treats columns more as variables than anything else.  For this reason, I'll be using the terms 'column' and 'variable' interchangeably from now on.  </p>
One last thing before I move on to selection.  It's possible for AGCA to represent even more complex special tables.  For example, if I wanted wanted to compute the following SQL query</p>
SELECT SUM(exp(2, A) + 2*B) FROM R;</pre>
Any computable expression can be turned into one of these special tables, regardless of how many variables appear in it.  To compute the the SQL query above I could use</p>
AggSum([], R(A,B) * {exp(2, A) + 2 * B})</pre>
Which is equivalent to</p>
AggSum([], R(A,B) * {exp(2, A)} + {2 * B}) = AggSum([], R(A,B) * {exp(2, A)} + {2} * {B})</pre>
Regardless, every variable that appears in the formula defining a special table will be one of the special table's columns.  </p>
AGCA uses the very same idea to implement selection (filtering).  Let's say we wanted to compute:</p>
SELECT COUNT(*) FROM R WHERE A < B;</pre>
In other words, we want to filter out some rows of R (the ones where A >= B), and keep others (where A < B).  Again, AGCA deals entirely in multiplicities.  Filtering out a row is equivalent to setting its multiplicity to 0.  Keeping a row means leaving its multiplicity unchanged.  </p>
___A__B______________#__</pre>
 < 1, 1 > -> 2 * 0 = 0</pre>
 < 2, 2 > -> 1 * 0 = 0</pre>
 < 2, 5 > -> 3 * 1 = 3</pre>
Just like a special table exists for any expression, any boolean predicate can be turned into a special table where each row has a multiplicity of either 0 or 1.  For example, for A < B:</p>
___A__B______#__</pre>
 < 1, 1 > -> 0</pre>
 < 1, 2 > -> 1</pre>
 < 1, 3 > -> 1</pre>
    ...</pre>
 < 2, 1 > -> 0</pre>
 < 2, 2 > -> 0</pre>
 < 2, 3 > -> 1</pre>
    ...</pre>
And there you have it.   Computing the filtered count of R is as simple as</p>
AggSum([], R(A,B) * {A < B})</pre>
Note that this makes the special tables {1} and {0} equivalent to the booleans TRUE and FALSE respectively.  Also note that while * is equivalent to a boolean AND, + is not equivalent to a boolean OR ({1} + {1} = {2}).  More on how we handle this later.</p>


</p>
With that, I've covered all of the basics of AGCA.  Next week, a quick reference summary of everything that I've covered so far, and the week after that, I'll dive into the viewlet transform itself.  </p>
By the by, I'm ignoring two features of AGCA for now, one needed to support nested aggregates, and one needed to support existential quantification (as well as certain kinds of nested aggregates).  I'll get to these once I've covered the viewlet transform and can better describe the challenges that nested aggregates create.</p>
And to anyone who reads this before tomorrow: Keep an eye on http://www.dbtoaster.org for the official DBToaster release on July 9.</p>


AGCA, The language of change (Part 3: The Sum Project)
2012-06-30T00:00:00+00:00
For a few weeks now, I've been talking about the AGCA query language, a language for incremental computation.  If you haven't already done so, you should probably read Part 1</a> and Part 2</a> before continuing on with this post.  </p>
Just to recap, in AGCA, queries are written down as algebraic formulas.  The most basic term in the language is the table (Relations, if you want to be fancy).  Multiplication is the natural join, and addition is bag union.  </p>
And of course, the most important thing about AGCA is that Everything is Multiplicities.  Unlike SQL, where the result of a query is simply a list of output rows, a query result in AGCA is more like a lookup table.  Each row of the output is associated with a multiplicity</em> (loosely speaking, the number of times that the row occurs in the result). Because the rows are unique, you can use the row to look up its multiplicity in the query result.  To use the technical terms, query outputs are Maps (a.k.a., Hashes, Dictionaries, HashMaps, etc…), each row is a key in the map, and multiplicities are the mapped values.  </p>
Note by the way, that this doesn't stop us from talking about empty rows.  Look at the following SQL query:</p>
SELECT </pre>
FROM ACTORS;</pre>
Or equivalently, in English, "give me an empty row for every actor."  There's a good chance your favorite SQL system won't approve of this query.  You can also certainly make a good argument that this query isn't especially useful.  That said, the query does have a meaning.  Give me one row (with nothing in it) for every actor.  </p>
Recall one more thing from last week: How addition/union works in AGCA.  "Duplicate" rows on either side are merged, and their multiplicities are added together.  If a row occurs twice on one side of the union, and three times on the other, then the final unioned output has five copies of the row (or as AGCA would put it, the row has a multiplicity of 5 in the output).</p>
Where am I going with this?  Well, empty rows are all identical.  So, if you have a result that contains only empty rows, the result is guaranteed to have exactly one row (or zero rows, that is, one row with multiplicity 0).  Let's see an example on this table of actor first names.  </p>
__FIRSTNAME______#__</span></pre>
< Steve     > -> 2</pre>
< Jim       > -> 1</pre>
< John      > -> 3</pre>
Let's say we get rid of the FIRSTNAME column (project it away, to use the technical term).  We end up with</p>
________#__</span></pre>
<  > -> 2</pre>
<  > -> 1</pre>
<  > -> 3</pre>
But that's wrong.  Every row is supposed to be unique.  All those duplicate empty rows need to be merged together.  So, just like we merge together rows when computing a UNION, we add up the multiplicities of these empty rows.  </p>
________#__</span></pre>
<  > -> 6</pre>
What exactly just happened here?  Well, by projecting away the FIRSTNAME column, we've essentially computed the COUNT(*) of the number of rows in the input.  Recall that when I first described AGCA, I mentioned that every query has an implicit COUNT(*) around it.  Instead of</p>
SELECT FROM ACTORS;</pre>
what AGCA actually computes is</p>
SELECT COUNT(*) FROM (SELECT FROM ACTORS);  </pre>
or, put more simply</p>
SELECT COUNT(*) FROM ACTORS;</pre>
The same idea can actually be taken a bit further.  Let's say you have the following table:</p>
__FIRSTNAME__LASTNAME_________#__</span></pre>
< Steve ,    Carell      > -> 1</pre>
< Steve ,    Coogan      > -> 1</pre>
< Jim   ,    Carrey      > -> 1</pre>
< John  ,    Depp        > -> 1</pre>
< John  ,    Galecki     > -> 1</pre>
< John  ,    Rhys-Davies > -> 1</pre>
What happens if we project away just the LASTNAME column?  We get 2 Steves, 1 Jim, and 3 Johns, exactly the same table that we started with.  In other words, by projecting away just the LASTNAME column, we end up computing a group-by COUNT(*) aggregate:</p>
SELECT FIRSTNAME, COUNT(*)</pre>
FROM ACTORS </pre>
GROUP BY FIRSTNAME;</pre>
This technique of using projection to compute the COUNT(*) aggregate also lets us compute group-by aggregates.  Projection and the COUNT(*) aggregate are the same thing in AGCA</strong>.  AGCA uses a special operator called AggSum to represent this operation.  For example, the above group-by COUNT(*) aggregate is written as:</p>
AggSum([FIRSTNAME], ACTORS(FIRSTNAME,LASTNAME))</pre>
Or in general:</p>
AggSum([{group by var 1}, {group by var 2}, …], {aggregated AGCA expression})</pre>
And there you have it: How AGCA handles projection and the COUNT(*) aggregate.  Next week, the SUM() aggregate, and conditions/selection in AGCA.</p>


AGCA, The language of change (Part 2: Ringing in Change)
2012-06-24T00:00:00+00:00
Last week, I started talking about the AGCA (Aggregate Calculus) query language.  AGCA is a query languages that makes an explicit distinction between the parts of data that can be easily managed incrementally, and the parts that can not.  This makes it incredibly useful for incremental computation techniques like the viewlet transform.</p>
At the heart of AGCA is a simple mantra that I harped on extensively last week: Everything is multiplicities.  If I write down a query, such as:</p>
CUSTOMER(FIRSTNAME)</pre>
What I will get back is a table with one row for each distinct</em> customer first name.  </p>
__FIRSTNAME______#__</span></pre>
< John      > -> 8</pre>
< Alphonse  > -> 1</pre>
The table has two columns: the FIRSTNAME column, and a count, or multiplicity for each row.  In the example above, I have 8 customers named John, and 1 named Alphonse.  </p>
There's actually another way of looking at CUSTOMER(FIRSTNAME).  You can think of it as a function (for those who are interested, it's actually something called a monad</a>).  If I give it a value for FIRSTNAME, it gives me back the number of customers who have that particular first name.  </p>
And this leads me to this week's topic: JOINs (definition here</a>) and UNIONs (specifically what's known as a "Bag Union", defined briefly here</a>). As we'll see, there's a nice way of looking at these two common database operations.</p>
For those unfamiliar with the concept, a JOIN between two tables matches up rows of each table based on some rule.  For example, I might have two tables with information about bicycles available for purchase: FRAMES(COLOR, TIRESIZE) and TIRES(TIRESIZE, TIRE).  Here's some sample data (along with multiplicities… ignore those for now):</p>
__COLOR____TIRESIZE______#__</span></pre>
< Blue ,   26" >    ->   1</pre>
< Red ,    26" >    ->   3 </pre>
< Red ,    20" >    ->   1</pre>
< Black,   20" >    ->   2</pre>
__TIRESIZE__TIRETYPE______#</span></pre>
< 26" ,     Mountain > -> 1</pre>
< 26" ,     Road     > -> 1</pre>
< 20" ,     Road     > -> 2</pre>
Let's say I'm interested in all the possible options I have for a new bike.  In this case, I need to pick out a frame and a tire type.  Clearly, I want to make sure that the tire I get is appropriate for the frame: The TIRESIZE of the tire has to match up with the TIRESIZE of the bike I get.  So, to enumerate all the possible options, I can compute a JOIN between these two tables on the condition that the TIRESIZEs are identical (again, ignore the multiplicities for now):</p>
__COLOR____TIRESIZE__TIRETYPE______#__</span></pre>
< Blue ,   26" ,     Mountain > -> 1</pre>
< Blue ,   26" ,     Road     > -> 1</pre>
< Red ,    26" ,     Mountain > -> 3</pre>
< Red ,    26" ,     Road     > -> 3</pre>
< Red ,    20" ,     Road     > -> 2</pre>
< Black ,  20" ,     Road     > -> 4</pre>
This is actually a special kind of join (condition) called a natural join</em>; A natural join matches up rows based on columns that share the same name (in this case, the TIRESIZE column).  As we'll see in a bit AGCA uses only natural joins.  But, how exactly do joins work in AGCA?  </p>
Well, let's start by looking at those multiplicities in our input data.  There is only a single type of Blue bike with 26" tires available for purchase, but two different types of 26" tires (Mountain and Road).  Together these produce 2 different options for purchase (the first 2 rows of the output).  </p>
The multiplicity of the Red, 26" bike row is 3.  What this means in practical terms is that there are 3 different types of Red, 26" bikes available.  If I wanted one of those, I would actually have 6 different options (2 types of tire as before, and now 3 different types of bike).  </p>
The same thing happens with the 20" bikes.  There are two different types of 20" tire (this time, both are road tires).  There are also 2 different types of Black, 20" bikes.  So, we have 4 different purchase options.</p>
You've probably seen the pattern by now.  When we JOIN two rows together, the multiplicities of the JOINed rows multiply</strong>.  Because of this, JOINs in AGCA are written down as products.  For example, we'd write down the above join as:</p>
FRAMES(COLOR, TIRESIZE) * TIRES(TIRESIZE, TIRETYPE)</pre>
Note that the join condition is not explicitly specified in this query.  This is because all joins in AGCA are natural joins.  In this case, TIRESIZE appears in both FRAMES and TIRES, so we know that the query is only asking to match up the TIRESIZEs of both inputs.</p>
Remember that I mentioned earlier that you can think of a table as a function (monad).  This holds for any AGCA expression.  The example join defines a function (monad) with three parameters: COLOR, TIRESIZE, and TIRETYPE.  Let's say I wanted to evaluate it with the parameters (Black, 20", Road):</p>
FRAMES(Black, 20") * TIRES(20", Road) = 2 * 2 = 4</pre>
Cool, right?</p>
By the way, if there are no overlapping column names, then the natural join is just a cartesian cross product</a>.</p>
Ok, so, what about UNIONs?  Well, let's say we have tables representing several different purveyors of coffee beans: ORENS(ROAST), CTB(ROAST) (respectively):</p>
__ROAST_________#__</span></pre>
< Espresso > -> 2</pre>
< Light    > -> 1</pre>
__ROAST______#__</span></pre>
< Medium > -> 2</pre>
< Light  > -> 3</pre>
If I wanted to know all of my coffee-purchasing options, I could compute the union of these two tables.  Recall that we're working with multisets/bags, so to keep things simple let's assume that there's no overlap between the offerings of both stores.  If that's the case, then we can just merge the two tables together, adding together the multiplicities of rows that appear in both inputs:</p>
__ROAST_________#__</span></pre>
< Espresso > -> 2</pre>
< Medium   > -> 2</pre>
< Light    > -> 4</pre>
In short, UNION adds row multiplicities together</strong>. Just like AGCA uses product to represent joins, sum represents unions.  So our coffee shop query (let's call it Q) is written down as</p>
Q(ROAST) := ORENS(ROAST) + CTB(ROAST)</pre>
Again, we can look at this query as a function.  For example:</p>
Q(Light) = ORENS(Light) + CTB(Light) = 1 + 3 = 4</pre>
There are some caveats here.  First off, the column names of the things being UNIONed together typically have to be the same.  ORENS(ROAST) + FRAMES(COLOR, TIRESIZE) doesn't really make much sense.  I'll return to this assumption before long, but be aware of it for now.  </p>
Second, as I briefly mentioned last week, tables (and expressions in general) contain an infinite number of rows -- although only a small number have non-zero multiplicities.  This is something to keep in mind when thinking about AGCA expressions as functions.  For example:</p>
Q(Espresso) = ORENS(Espresso) + CTB(Espresso) = 2 + 0 = 2</pre>
Or, going back to our bike options example:</p>
Q(Black, 20", Mountain) = FRAMES(Black, 20") * TIRES(20", Mountain) = 2 * 0 = 0</pre>
Note how the zeroes propagate through joins.  This makes sense if you think about it.  JOINing against a row that doesn't exist doesn't produce any output rows (ignoring outer joins for the moment).</p>
For those interested in algebraic structures, AGCA actually forms something called a Ring, with product and sum defined as above.  I'll talk about constants in a few weeks, but you can think of the "zero" and "one" values of the ring as special tables ZERO() and ONE() with no columns, and only a single row each: <> -> 0</span>, and <> -> 1</span>, respectively.  The set underlying the ring is a specific incarnation of the monads that I've been alluding to.  </p>
That's it for now.  Next week: Two typically unrelated operations, brought together by AGCA in a somewhat surprising way: PROJECTion and the COUNT aggregate.</p>


AGCA, The language of change (Part 1: Everything is Multiplicities)
2012-06-16T00:00:00+00:00
Before I can go into detail on the viewlet transform, I first need to talk about a language called the AGgregate CAlculus.  Over the coming weeks, I'll try to give a high-level overview of the language from a practical perspective, and hopefully give you some insight into why it is important.  </p>
Although the name "Aggregate Calculus" might sound imposing, AGCA is actually very close to SQL.  Its key feature (one of the reasons that it is crucial for the viewlet transform) is that it separates values that can be maintained incrementally from those that cannot.  For reasons that will become clear, we'll call incrementally maintainable values multiplicities</strong>, and non-incrementally maintainable values variables</strong>.</p>
The core mantra… the one thing that everyone who I've ever tried to teach AGCA to has struggled to wrap their head around (at first, anyway), is that Everything is Multiplicities</strong>.  Remember this phrase.  Everything is Multiplicities</strong>.  </p>
I'm being a little overly general.  If I wanted to be precise, I'd say that "Everything is a mapping from tuples of variables to multiplicities."  That's a bit of a mouthful (and I like short catch phrases), so let's stick with Everything is Multiplicities</strong>.</p>
And just to be sure you're following along: Everything is Multiplicities</strong>.</p>
So, how do we write queries in AGCA?  Well, to start, we need some way to refer to our input data.  In spite of trends in the corporate world, AGCA works with relational data.  So, all all of our inputs will be tables (or if you want to get fancy, "relations").  If we want to refer to a table, we write down its name and give each of its columns a name.  For example, if we write down:</p>
R(A,B,…)</pre>
We mean "all of the rows (or tuples) of R, and we'll name first column of R 'A', name the second column 'B', and so forth…"  This is pretty much the simplest possible SQL query you can think of:</p>
SELECT A, B, … FROM R;</pre>
Simple, right?  Well, ok, there's actually a little twist.  See, in SQL, you're allowed to have several identical rows in a table (by default anyway, keys have to be added explicitly).  To use the technical term, SQL works with what are called multisets (also known as "bags").  So, we're going to do something clever in AGCA.  Let's say you have a table of your customer's first names.  If you write the AGCA expression:</p>
CUSTOMER(FIRSTNAME)</pre>
You're not</em> going to get one row for every customer.  Instead, you're going to get something like this:</p>
__FIRSTNAME______#__</span></pre>
< John      > -> 8</pre>
< Joe       > -> 3</pre>
< Steve     > -> 5</pre>
< Alphonse  > -> 1</pre>
That is to say, you'll get one output row every distinct</em> row in your table, together with the number of times that this row occurs in your table (that is, you get its multiplicity</strong>).  In the above example, you have 8 customers named John, but only 1 customer named Alphonse.  This is the core idea of AGCA: Everything is multiplicities.  If it helps, you can think of every AGCA expression as having an implicit group-by COUNT(*) aggregate around it.  </p>
CUSTOMER(FIRSTNAME)</pre>
is effectively the SQL query:</p>
SELECT</strong> *, COUNT(*) FROM</strong> CUSTOMER GROUP BY</strong> CUSTOMER.*</pre>
By the way, there's a little bit of weirdness here.  AGCA doesn't have precisely the same semantics for COUNT as SQL.  See, every AGCA expression describes an infinite number of rows.  Even CUSTOMER(FIRSTNAME) effectively has an infinite number of rows: There aren't any customers named Zardoz, but the expression does technically</em> contain the row < Zardoz > -> 0.  That said, in general, we're only interested in a few of those rows (the ones that aren't 0).  In the interest of keeping things intuitive, I'm going to sweep this issue under the rug for now and come back to it in a future post.</p>
That's it for now.  Tune in next week for: JOINs and UNIONs in AGCA.  </p>
(If you want to learn more now, and have a good understanding of algebraic structures, have a look at the PODS2010 paper on AGCA: "Incremental Query Evaluation in a Ring of Databases</a>")</p>


DBToaster and the Viewlet Transform
2012-06-10T00:00:00+00:00
A big issue these days is large, rapidly changing data.  Users often need to keep a close eye on this data.  Algorithmic trading, scientific computing, network monitoring, and even things like data warehousing are all examples of areas that have lots of data, and that need to react very quickly to certain (potentially complex) conditions in that data.</p>
The overarching goal of the DBToaster project is to produce a tool chain capable of effectively performing these monitoring tasks.  Our latest paper (to be presented at VLDB 2012 this August) discusses one of the core ideas behind our approach: exploiting incrementality (through something we call the Viewlet Transform).</p>
To get at the basic idea across, let me use a common task as an analogy: the monthly report.  If your'e in a relatively stable business, the content of the report will probably be mostly the same from month to month.  Instead of rewriting the report from month-to-month, you might just take last month's report and update it with any new facts, figures, and other changes in the past month (in fact, your boss might be interested in only the changes… but that's getting a little off-track). This still requires a lot of work.  If the report has a lot of figures, you wont re-create the figures from scratch either.  Instead, you'll probably have a spreadsheet (also from last month) that you can just punch the new numbers into.</p>
Loosely speaking, you have a repeating task (writing your report) that is easier to perform if you only have to figure out what changed (the figures) since the last time you did it.  This idea has been around for a long time in databases (since the 80s at least) in the form of something called Incremental View Maintenance (or IVM for short).  Let's say you have a query that you want to repeatedly evaluate.  If you're smart, you'll just evaluate the query result once and save the result for the next time you need the answer.  </p>
But of course, the data you're querying might change in the meantime.  The core idea behind IVM is that you can evaluate what's known as a Delta Query, which is a simpler form of the original query.  Instead of giving you the full query answer, it looks at the changes to the input data, and tells you what changes in the query results.  This delta query is usually simpler and faster to evaluate than the original query, but can still be pretty slow (especially if the original query is a biggie), making IVM a poor choice for many realtime applications.</p>
Let's go back to our example of the monthly report that you update each month.  Even though this is faster than creating the report from scratch, you still have to update all the figures as well.  Of course, you don't create the figures from scratch either.  If you're smart, you'll just edit last month's spreadsheets.  The viewlet transform is based on exactly the same idea.  The delta query is a query that you evaluate over and over and over again.  We figure, why not just evaluate it once and then just update it when the data changes.</p>
So now you have your original query, and some delta queries.  Instead of re-evaluating the delta queries every time your data changes, you evaluate the delta query once and store the result.  Now, whenever your data changes, you only need to read the delta query result out of storage, instead of doing any expensive computations.  Of course, now you need to keep your delta queries up to date as well.  You do this by using delta queries of each of the delta queries (A "second-order delta").  This process continues (giving you third, fourth, fifth, etc… order deltas), with each delta query becoming progressively simpler than the last.  Eventually you reach a point where the delta query is incredibly simple, and you stop.  </p>
You might have a lot of these queries sitting around and needing to be kept up-to-date, but each of them reduces the cost of maintaining another query by enough that it's incredibly worth it.  Combined with other techniques, we've gotten a typical performance improvement of 3-4 orders of magnitude over several commercial data management systems.  </p>
And that's the basic idea of the viewlet transform.  More soon.</p>
-Oliver</p>


Hello world!
2012-05-29T00:00:00+00:00
Welcome.</p>
My name is Oliver Kennedy, and starting this Fall, I'll be junior faculty at UB.  I'm starting this blog to give people a bit of an idea of the stuff I'm working on.  My goal is to keep things relatively short.  I'm not sure exactly what I'll be posting here, but for general posts I'll be keeping the language at a level that's understandable by most people (not just those in CS).</p>
So, without further ado, let me say hello again, and welcome you to The Cutting Edge.</p>
(Oh, and why the name?  I like swords, and I like tech)</p>

ODIn: Online Data INteractions

Clangd

Fixing Anubis

NEDB Day 2026

Tricking Out My Console

Minnowbrook / Aggregation Trees

FastPDB

CURE-C STTR Phase 2

PL/DB Sp 2024

ODIn @ HILDA '24

Rust...

Librem5 and Mobile Linux

View Serializability

UADBs win Reproducibility Award

Re-using work in data integration

VizierDB makes an appearance at CIDR

Papers at DBPL, SIGMOD, and VLDB

A retrospective on 2 years of linux

Query Log Compression for Workload Analytics

Papers in VLDB 2019 and TODS

Congratulations Graduates

Review: Purism Librem 13v3

Vizier Workflows (rant)

NEDB 2018

CSE 50 Conference!

Beta Probabilistic DBs at SIGMOD

CIDR Recap

Stop the Truthiness and Just Be Wrong

Mimir group at CSE50 Kickoff Poster Session

Mimir at CIDR, EDBT

Project Based Learning

CSE 662 Demo Day

Stratos Idreos at UB-CSE

In case iCloud login breaks your computer

Oliver @ CSE 501

PocketData @ TPC-TC

What if Databases Could Answer Incorrectly?

Ontology for Insider Attacks @ MIST 2015

1 Month of SQLite Smartphone Logs at TPC-TC

Lenses @ VLDB

Conference Summary - CIDR

CSE 562 Syllabus is live

maybe we got a HotMobile Paper

Just in Time Data Structures @ CIDR

Ying @ Systems Lunch

Tentative CIDR 2015 Program Posted

Congrats Niccolo and Jan!

CSTA @nalytics Workshop

JITDs @ CIDR

Gathering Data, Interactive Programming, and Analysis

Finding truth in the bits

Why text editors are bad for programmers.

Log as a Service (Part 2 of 2)

Expressiveness vs Efficiency

SIGMOD Wrapup

Log as a Service (Part 1 of 2)

Laasie: Building the next generation of collaborative applications

Are you sure?

Never tell me the odds

Languages with first-class ORM primitives

Semantics as Data

The Computation's Effects</h3> The simplest thing we can do is to ask a question about what it will compute. These questions span the range from the trivial to the typically intractable. For example, we can ask about…</p>

References</h3> [1] Gawlick, D. et al. Applications for expression data in relational database systems. 609–620.</p> [2] Chafi, H. et al. 2010. Language virtualization for heterogeneous parallel computing. ACM Sigplan …</em>. (2010).</p> </p> </p> </p>

References</h3>
[1] Gawlick, D. et al. Applications for expression data in relational database systems. 609–620.</p>
[2] Chafi, H. et al. 2010. Language virtualization for heterogeneous parallel computing. ACM Sigplan …</em>. (2010).</p>
</p>
</p>
</p>