The ODIn Lab - Uncertainty in Distributed Computation

Probabilistic databases are a solution to a simple problem -- Sometimes you don't have all the data.

Probabilistic databases address this problem in the context of a specific domain: asking questions about data that is incomplete, imprecise, or noisy. But this is only one domain that this problem occurs in; Noisy, incomplete data occurs everywhere.

A prime example of this is distributed computation. Each node participating in the distributed computation knows (for certain) what is going on locally, but not what's going on elsewhere. If an update occurs on one node, it takes time to propagate to other nodes.

A good way to think of this is that the node has its own view of the state of the world. Slowly, over time, this view diverges from the "real" state of the world. As the node communicates with other nodes, the view reconverges.

Many early distributed protocols were designed to enforce this sort of convergence, at least to the point where certain properties (e.g., relative ordering) could be guaranteed. For the past few years, the fashion has been to use eventual consistency, where the end-user is presented with results that are not guaranteed to be entirely accurate.

This doesn't have to be a binary choice; many such systems (Zookeeper[1], PNUTS[2], Percolator[3], etc...) offer a hybrid consistency model where end-users can choose to receive results guaranteed to be consistent, albeit at the cost of higher access times.

What I've been seeing lately is a tendency to take this even further: To actually try to capture the uncertainty in the computation in the distributed programming model itself. The first instance that my quick (and quite incomplete) scan of deployed systems was Facebook's Cassandra [4], which used a technique called φ-accrual [5] to get a running estimate of the likelihood of a particular server being up or down.

More recently, a similar idea has appeared in Google's Spanner [6]. Here, the uncertainty was on the timing of specific events, and the goal was to determine relative ordering and to obtain guaranteed consistency by establishing a bound on how accurate (or inaccurate) the timestamps you're using are.

This idea can be taken a lot further. Although I can't imagine programmers wanting to explicitly account for uncertainty in their code, they may be willing to work with a language that does this accounting for them. Maybe I don't need a precise result to present to the user, maybe I just need something in the right ballpark. Maybe I just need an order of magnitude!

What would a language designed around this look like?

How could the programmer specify the bounds on uncertainty that they were willing to accept?

Could such a language be combined with online techniques (i.e., provide the end-user with a stream of progressively more accurate answers).

Can PL ideas such as promises be adapted to this context? Here's an answer, it has accuracy X. The result of the computation you want to do with it can also be computed, and the uncertainty of that computation (based on the uncertainty in the input) is Y.

This seems like it would be a really cool programming platform, if it could be made to be both usable and efficiently functional.

Citations

[1] Hunt, P. et al. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. USENIX ATC. (2010).

[2] Cooper, B.F. et al. 2008. PNUTS: Yahoo!'s hosted data serving platform. Proceedings of the VLDB Endowment. 1, 2 (Aug. 2008), 1277–1288.

[3] Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. (2010).

[4] Lakshman, A. and Malik, P. 2010. Cassandra—A decentralized structured storage system. Operating systems review. (2010).

[5] Hayashibara, N. et al. The φ accrual failure detector. 66–78.

[6] Spanner: Google's Globally-Distributed Database: http://research.google.com/archive/spanner.html