The ODIn Lab - 2022 03 23 view serializability

Did you mean view serializability, or view serializability?

I'm a bit ashamed that it took me as long as it did to have this realization — after about six years of teaching databases, inspiration finally struck in the middle of answering a student's question. It really is a subtle point, and a potential disconnect between folks teaching databases (or at least all of the database textbooks) and folks who actually use databases in practice.

TL;DR: What virtually all database textbooks call view serializability is not what virtually all databases call view serializability.

Background: Serializability

First a quick bit of background: Programs are sequences of steps. Often though, not all of those steps have to happen in sequential order. Take the following pseudo-python code:

a = f(1, n)
b = f(2, n)
print(f"The result is {g(a, b)}")

We can't run line 3 until after we're done with lines 1 and 2, but we don't really care about the order in which lines 1 and 2 happen [1]. The following program should produce exactly the same result:

b = f(2, n)
a = f(1, n)
print(f"The result is {g(a, b)}")

When we say this reordering preserves "serializability", it just means that the end result is exactly the same: There's no externally visible consequence to changing the order of lines 1 and 2.

This style of reasoning is super helpful when we want to develop parallel code (or systems that run code in parallel). If we can prove to ourselves that it doesn't matter what order we run two specific steps in, it means that we can safely run those steps in parallel.

Alternatively, if we come up with some scheme that allows two programs to control how their steps are ordered, we can prove to ourselves that this scheme is correct by proving that any order of operations that the scheme would allow is guaranteed to produce an output that is exactly the same as the correct order we started with [2].

Unfortunately "exactly the same" is a misleadingly hard-to-pin-down phrase. Clearly, there will be differences. Even if the values of variables a and b are identical, changing the order of execution will affect subtle things like where in memory a and b live, the value of a.id and b.id, and other minor (and generally inconsequential details). This makes it difficult to use "serializability" for any sorts of proofs.

So, when trying to prove correctness of some concurrency scheme, folks tend to come up with more precise specifications. For example, a common form of serializability called "conflict serializability" says that we're allowed to reorder any two lines of code (call them A and B) as long as: 1. A does not read from a variable that B writes to. 2. B does not read from a variable that A writes to. 3. A and B do not write to the same variable. If two programs differ only in reorderings that respect the above rules, we call them conflict equivalent.

In our example above, all three conditions are satisfied. Line 1 reads from f and n and writes to a, while line 2 reads from the same variables and writes to b.
Reordering lines 1 and 2 in the example above preserves conflict equivalence.

As it turns out, conflict serializability is a pretty powerful tool. You can prove to yourself that 2-phase locking and optimistic concurrency control, two of the most fundamental forms of concurrency control in databases both guarantee that any sequence of steps that they allow must preserve conflict serializability [2].

Both of these schemes, however, are a bit expensive.
2-phase locking is really conservative and often acquires locks it doesn't need to. Conversely, conflict serializability requires (i) the ability to do copy-on-write, and (ii) an expensive conflict checking and merge step.

View Serializability

Enter timestamp concurrency control. The idea here is that we attach two counters to each variable: a read timestamp and a write timestamp. Each program (transaction) also gets assigned a timestamp (not really wall clock time... just a number that keeps growing). The read timestamp is the timestamp of the newest (highest numbered) transaction to read from the variable. The write timestamp is the timestamp of the newest (highest numbered) transaction to write to the variable.

Timestamp concurrency control is a form of optimistic concurrency control: A transaction starts and then keeps plugging along until it discovers that it is about to perform an action out-of-order. If/when that happens, the database cleans up after the transaction and restarts it.

For example, if one transaction comes along and tries to read a variable that was already written by a newer (higher numbered) transaction, it's an out-of-order read and the transaction restarts[3]. Similarly, if one transaction tries to write to a variable that a newer transaction has already read from, the transaction is restarted.

So far, so good. Our target execution order is given by the timestamps, and after a little bit of thought you should be able to convince yourself that we're enforcing rules 1 and 2 of conflict serializability by restarting transactions that are about to violate them.

What about rule 3 though? It's easy enough to detect: A transaction trying to write to a variable with a write timestamp that's newer than the transaction's timestamp.

a = f(1, n)
a = f(2, n)
print(f"The result is {g(a)}")

Let's say that each line is one transaction, and that we're trying to run lines 1 and 2 in parallel. Line 1 gets timestamp 1 and line 2 gets timestamp 2. As it just so happens, line 2 finishes before line 1 gets a chance to write to a. Now a has a write timestamp of 2 when line 1 comes along and tries to write to it. This is clearly a violation of rule 3 (and we shouldn't allow it). On the other hand, it's all the same if we just quietly throw its write away — the value is never actually used.

Note that here we are violating rule 3, and so the resulting scheme can not be called conflict serializable. On the other hand, this is a very controlled violation of the rule. We're only allowed to violate it because the value we're writing is never viewed. This is how view serializability (textbook edition) is defined: It's conflict serializability, but we're allowed to ignore writes that are never read. Moreover, you can prove to yourself that timestamp concurrency control guarantees view serializability.

Textbooks and Databases

So far, this sounds perfectly reasonable: View serializability (textbook edition) sounds perfectly sane and safe. On the other hand, use the phrase "view serializability" in the vicinity of a DBA and you'll hear the explosion from miles away.

Simply put, view serializability (database edition) describes a concurrency mode in several production database systems that is an incomplete implementation of timestamp concurrency control without read timestamps. This is understandable, since read timestamps are really really really expensive. They introduce a concurrency bottleneck where there wasn't one before. On the other hand, without read timestamps, you have no way to detect when a transaction writes to a variable after another transaction has already read from it. For example:

a = f(2, n)
b = g(a)
print(f"The result is {b}")

If we parallelize lines 1 and 2, it's possible for line 2 to read whatever value happened to be sitting in a from before. We'll never detect this situation, because there's no read timestamp on a for line 1 to check against.

And that's pretty much it. View serializability (textbook edition) is a slight variation of conflict serializability that permits us to ignore unseen writes. View serializability (database edition) is a bastardized version of timestamp concurrency control with arguably useless ordering guarantees.

1: Assuming f is free of side effects.

2: In databases, inter-transaction order doesn't matter, so there are technically multiple "correct" orders. Still, the point holds.

3: The database can keep around older versions of the variable for what's called multiversion concurrency control to avoid read-after-write errors.