- Transactions

Transactions

April 6, 2021

Garcia-Molina/Ullman/Widom: Ch. 18.1-18.2, 19.1

The running theme

If X and Y are equivalent and Y is better,
then replace all Xs with Ys

Today's focus: Updates

SQL DDL


            INSERT INTO Trees (location, species) 
              VALUES ('123 A Street', 'pin oak');
            
            DELETE FROM Trees WHERE date > now();
            
            UPDATE Trees SET species = 'pin oak' 
              WHERE species = 'pinoak';

What is "Correct"?

: My command gets executed fully, or not at all.
: The final state of the data is "sane".
: No concurrency glitches.
: If things break, my data is still safe.

My command gets executed fully, or not at all.

#	...	species
1	...	pin oak
2	...	pinoak
3	...	pinoak
4	...	pin oak
...


            UPDATE Trees SET species = 'pin oak' 
              WHERE species = 'pinoak';

Null Pointer Exception!

T-Rex Attack!

User-abort!

#	...	species
1	...	pin oak
2	...	pin oak
3	...	pinoak
4	...	pin oak
...

My command gets executed fully, or not at all.

The final state of the data is "sane".


      Bob.Balance -= 20
      Alice.Balance += 20

Bob's balance shouldn't drop below $0

Constraints

Primary Key / Unique: Unique column value for every row.
Foreign Key: Referenced value must exist.
Domain: Restrictions on allowable values
(e.g., Balance >= 0 or NOT NULL)
Check: Any boolean-valued SQL expression

No concurrency glitches.

If Alice and Bob both update the data at the same time,
nothing should break.

If things break, my data is still safe.

Once a write is confirmed, it should survive a power outage, crash, or marmot attack.

Atomicity: My command gets executed fully, or not at all.
Consistency: The final state of the data is "sane".
Isolation: No concurrency glitches.
Durability: If things break, my data is still safe.

Each use case may require different properties

Relational DBMSes (e.g., Oracle, DB2, Postgres): Full ACID (more or less)
Caches (e.g., Redis, Memcached): No need for Durability
Key-Value Stores (e.g., Riak, Aerospike, LevelDB): Only trivial Atomicity required

What if I want correctness over more than one operation?

Transactions


      START TRANSACTION
      UPDATE Ledger SET Balance = Balance - 20
        WHERE name = 'Bob' 
      UPDATE Ledger SET Balance = Balance + 20
        WHERE name = 'Alice' 
      COMMIT

Transactions

A "batch" of operations that should execute together

START/BEGIN TRANSACTION: Start batching
COMMIT: Conclude the transaction and apply changes
ABORT/ROLLBACK: Undo all changes applied as part of the transaction

Atomicity: A transaction is either applied fully (COMMITed) or not at all (ABORTed).
Consistency: Constraints are enforced on the final state of the transaction (and the transaction is ABORTed if they fail).
Isolation: Two transactions running in parallel don't interfere with each other
Durability: Once the database returns from a COMMIT successfully, the data is safe from marmots

A little abstraction...

Object

An object is a...

... a database "thing"

... a table
... a record
... a page
... a column

Each database is different,
so let's treat objects abstractly for now.

Observation: If two clients modify different objects, they can't possibly interfere with each other.

... but I'm getting ahead of myself

Transaction

A sequence of reads from and writes to objects.

... followed by a COMMIT or ABORT

What does it mean for a transaction to be Isolated?

Alice and Bob submit transactions at the same time!

Option 1: Alice's transaction executes to completion.; Bob's transaction executes to completion.
Option 2: Bob's transaction executes to completion.; Alice's transaction executes to completion.

Example


                      def T1:
                            A = A + 100
                            B = B - 100


                      def T2: 
                            A = A * 1.06
                            B = B * 1.06

Example

Time	T1	T2
↓	`A = A + 100`
↓	`B = B - 100`
↓		`A = A * 1.06`
↓		`B = B * 1.06`

A gets an extra $6

Also acceptable

Time	T1	T2
↓		`A = A * 1.06`
↓		`B = B * 1.06`
↓	`A = A + 100`
↓	`B = B - 100`

B gets an extra $6

Not Acceptable

Time	T1	T2
↓	`A = A + 100`
↓		`A = A * 1.06`
↓		`B = B * 1.06`
↓	`B = B - 100`

A and B both get an extra $6

Idea: Don't allow updates in parallel!

Observation Blocking transactions is sloooow

Less Bad Idea: Create/Enforce the illusion that we're not allowing updates in parallel.

A Schedule

A sequence of Reads and Writes

Time	T1	T2
↓		`R(A)`
↓		`W(A) A = A * 1.06`
↓		`R(B)`
↓		`W(B) B = B + 1/06`
↓	`R(A)`
↓	`W(A) A = A + 100`
↓	`R(B)`
↓	`W(B) B = B - 100`

Schedule: A sequence of read and writes from one or more transactions to objects
Serial Schedule: A schedule with no interleaving; ... but with an arbitrary transaction order
Serializable Schedule: A schedule that produces the same output as a serial schedule

Question: How do we ensure that transactions are only ever executed with serializable schedules?

Problem: We can't know if a schedule will be serializable until the transaction is done

Concurrency Strategies

Locking (Pessimistic): Block transaction execution before it risks non-serializable behavior
Snapshot Isolation (Optimistic): Execute each transaction on a snapshot of the database and abort it if a serializable merge is impossible.
Timestamp Concurrency Control (Optimistic): Annotate each object with timestamps and abort transactions that trigger non-serializable behavior.

Can we convince ourselves that these concurrency strategies are guaranteed to produce serializable schedules?

Observation 1: Reads can never interfere with reads

You can reverse the order of two reads without changing the serializability of a schedule.

Observation 2: Operations on different objects can't interfere with each other.

You can reverse the order of two operations on different objects without changing the serializability of a schedule.

Conflict Equivalence

Two schedules are conflict equivalent if there is a sequence of pairwise "flips" (of reads, or operations on different objects) that gets you from one schedule to the other.

Example 1

Time	T1	T2
\|		`W(B)`
\|	`R(B)`	↑
\|	↓	`W(A)`
↓	`W(A)`

Example 1

Time	T1	T2
\|		`W(B)`
\|		`W(A)`
\|	`R(B)`
↓	`W(A)`

Conflict equivalent to a serial schedule!

Example 2

Time	T1	T2
\|		`W(B)`
\|	`R(B)`
\|	`W(A)`
↓		`W(A)`

Can't rewrite!

Conflict Serializability

A schedule is conflict serializable if it is conflict equivalent to a serial schedule.

How do we determine if a schedule is conflict-serializable?

Observation: We can't reorder Write/Write (or Read/Write) operations on the same object.

Two transactions accessing the same object create a partial order on the serial schedule

Example 1

Time	T1	T2
\|		`W(B)`
\|	`R(B)`
\|		`W(A)`
↓	`W(A)`

T2's write to B "happens before" T1's read

T2's write to A "happens before" T1's write

No cycles in the partial order!

Example 2

Time	T1	T2
\|		`W(B)`
\|	`R(B)`
\|	`W(A)`
↓		`W(A)`

T2's write to B "happens before" T1's read

T1's write to A "happens before" T2's write

Cycle! No equivalent serial schedule!

Time	T1	T2	T3
\|	`W(A)`
\|		`W(A)`
\|		`W(B)`
\|	`W(B)`
↓			`W(B)`

T1's write to A "happens before" T2's write

T1, T2's writes to B "happen before" T3's write

T2's write to B "happens before" T1's write

Cycle! No equivalent serial schedule!

Time	T1	T2	T3
\|	`W(A)`
\|		`W(A)`
\|		`R(B)`
\|	`R(B)`
↓			`W(B)`

T1's write to A "happens before" T2's write

T1, T2's read on B "happens before" T3's write

T2's read on B "happens before" T1's read (but it doesn't count)

No Cycle! Has an equivalent serial schedule!

Time	T1	T2	T3
\|	`W(A)`
\|		`W(A)`
\|		`R(B)`
\|			`W(B)`
↓	`R(B)`

T1's write to A "happens before" T2's write

T2's read on B "happens before" T3's write

T3's write to B "happens before" T1's read

Cycle! No equivalent serial schedule!

"Happens Before" graphs

Create one node per transaction

For every W/W, W/R, or R/W pair of operations on the same object draw an edge from the first operation to the second.

An acyclic "Happens Before" or Dependency Graph is conflict serializable.

Forcing Acyclicity

Locking: Enforce a specific execution order.
Snapshot Isolation: Declare a serial order and abort transactions when they create reverse edges
Timestamp Concurrency Control*: Declare a serial order and abort transactions when they create reverse edges

Note

All three approaches simplify the problem by unilaterally picking an execution order and mandating that the schedule conforms to it.

We might get false positives: schedules that would ultimately end up being conflict equivalent to some schedule, but aren't equivalent to the one we picked (arbitrarily).

Next Time

Pessimistic Concurrency Control