April 10, 2019
What if the DB fails during a write.
IOs aren't atomic
Atomicity and Durability might be violated!
What if we need to page out some pages modified by a live transaction?
If the transaction aborts, the page state needs to be reverted.
Atomicity might be violated
When the system reboots/the power comes back on, we assume that successful writes are still there and failed writes are not.
W(A: 10)W(C: 8)W(E: 9)COMMIT| Page | Value | |
|---|---|---|
| A | 8 | 10 | 
| B | 12 | |
| C | 5 | 8 | 
| D | 18 | |
| E | 16 | 9 | 
COMMIT.COMMIT = don't replay the log.COMMIT succeeds, no DB writes succeedCOMMIT succeeds, at least one DB write succeedsCOMMIT succeeds, all DB writes succeedNote:  The transaction is durable as soon as we can guarantee the transaction's COMMIT is in the log.
We can consider the commit successful even if the data pages aren't on disk yet.
W(A: 10)W(C: 8)W(E: 9)COMMIT| Page | Buffer | Disk | |
|---|---|---|---|
| A | 10 | 10 | 10 | 
| B | 12 | ||
| C | 5 | 8 | 5 | 
| D | 18 | ||
| E | 9 | 16 | 9 | 
Using only one log is more efficient for writes!
... but recovery is slower.
COMMIT entry.COMMITed transactions only.Recovery gets slower with every transaction.
Idea: Periodically mark down the index of the earliest log entry still needed
How do we decide that a log entry is no longer needed.
| Page | Status | First Log Entry | Data | 
|---|---|---|---|
| 24 | DIRTY | 47 | 01011010... | 
| 30 | CLEAN | n/a | 11001101... | 
| 52 | DIRTY | 107 | 10100010... | 
| 57 | DIRTY | 87 | 01001101... | 
| 66 | CLEAN | n/a | 01001011... | 
Keep the first log entry to modify any dirty page
Log entries lower than a page's 'first log entry' aren't useful.
Periodically write the lowest 'first log entry' into the log (CHECKPOINT).
COMMIT entry.COMMITed transactions.Problem: There may not be enough memory to store pages modified by uncommitted transactions.
| Timestamp | Transaction | Object | Value | Prev | 
|---|---|---|---|---|
| 10 | T1 | Page 5 | 1010... | 00101... | 
| 11 | T2 | Page 3 | 1000... | 0111... | 
| 12 | T1 | Page 1 | 0011... | 0001... | 
| 13 | T3 | Page 5 | 1100... | 1010 | 
Idea: Record the page's previous value.
If a transaction aborts, go backwards through the log and undo all of the changes
| TS | Xact | Object | Value | Prev | 
|---|---|---|---|---|
| 10 | T1 | Page 5 | 1010... | 0101... | 
| 11 | T2 | Page 3 | 1000... | 0111... | 
| 12 | T1 | Page 1 | 0011... | 0001... | 
| 13 | T3 | Page 5 | 1100... | 1010... | 
| Page | Disk | 
|---|---|
| 1 | 0011 | 
| 2 | 1100 | 
| 3 | 1000 0111 | 
| 4 | 1001 | 
| 5 | 1100 1010 0101 | 
Observation: Going backwards through the log can be slow if you have a lot of concurrent transactions.
| Timestamp | Transaction | Object | Value | Prev | Last TS | 
|---|---|---|---|---|---|
| 10 | T1 | Page 5 | 1010... | 0101... | 7 | 
| 11 | T2 | Page 3 | 1000... | 0111... | 3 | 
| 12 | T1 | Page 1 | 0011... | 0001... | 10 | 
| 13 | T3 | Page 5 | 1100... | 1010... | 8 | 
Idea: Each log entry records the timestamp of the next most recent log entry for the same transaction.
Log entries form a linked list for each transaction
| Transaction ID | State | Last Log Entry | 
|---|---|---|
| T24 | VALIDATING | 99 | 
| T38 | COMMITTING | 85 | 
| T42 | ABORTING | 87 | 
| T56 | ACTIVE | 100 | 
| Page | Status | First Update | Data | 
|---|---|---|---|
| 24 | DIRTY | 47 | 01011010... | 
| 30 | CLEAN | n/a | 11001101... | 
| 52 | DIRTY | 107 | 10100010... | 
| 57 | DIRTY | 87 | 01001101... | 
| 66 | CLEAN | n/a | 01001011... | 
| Timestamp | Entry | 
|---|---|
| 0 | CHECKPOINT | 
| 1 | T1: UPDATE P5 | 
| 2 | T2: UPDATE P3 | 
| 3 | T1: ABORT | 
| 4 | T2: END | 
| 5 | T3: WRITE P1 | 
| 6 | T2: WRITE P5 | 
| CRASH | |
| Timestamp | Entry | |
|---|---|---|
| → | 0 | CHECKPOINT | 
| → | 1 | T1: UPDATE P5 | 
| → | 2 | T2: UPDATE P3 | 
| → | 3 | T1: ABORT | 
| → | 4 | T1: END | 
| → | 5 | T3: WRITE P1 | 
| → | 6 | T2: WRITE P5 | 
| CRASH | ||
| Xact | Status | LastTS | 
|---|---|---|
| T1 | ACTIVE ABORTING | 0 1 | 
| T2 | ACTIVE | 0 2 6 | 
| T3 | ACTIVE | 5 | 
As WAL recovery, before
Problem: Writing out a Checkpoint is SLOOOW (and blocking).
Idea: Snapshot in-memory state, then write in the background.
Record "START_CHECKPOINT" in log when the snapshot is taken.
Record "END_CHECKPOINT" in log when the snapshot is fully written.
Find the last "END_CHECKPOINT" and start recovery from the corresponding "START_CHECKPOINT" entry
Optimization: What if you crash during recovery?
Idea: Log "UNDO" operations as new writes
| Timestamp | Entry | 
|---|---|
| 0 | CHECKPOINT | 
| 1 | T1: UPDATE P5 | 
| 2 | T2: UPDATE P3 | 
| 3 | T1: ABORT | 
| 4 | T1: CLR Restore P5 | 
| 5 | T2: END | 
| 6 | T3: WRITE P1 | 
| 7 | T2: WRITE P5 | 
| Timestamp | Entry | |
|---|---|---|
| → | 0 | CHECKPOINT | 
| → | 1 | T1: UPDATE P5 | 
| → | 2 | T2: UPDATE P3 | 
| → | 3 | T1: ABORT | 
| → | 4 | T1: CLR P5 | 
| → | 5 | T1: END | 
| → | 6 | T3: WRITE P1 | 
| → | 7 | T2: WRITE P5 | 
| CRASH | ||
| Xact | Status | LastTS | 
|---|---|---|
| T1 | ACTIVE ABORTING | 0 1 0 | 
| T2 | ACTIVE | 0 2 6 | 
| T3 | ACTIVE | 5 |