Learning formal verification by writing a spec for verifying a zero-downtime database migration in PlusCal

posted on Dec 28, 2024

Formal verification can be used to analyze complex systems to ensure their correctness. Formal verification is the process of checking that a system satisfies specific properties (which are specified by the user) across all possible states. There are 2 main types of properties: "safety properties" and "liveness properties". For the formal definition, please refer here. If a property is not satisfied in any state but has the potential to be satisfied in a future state, then that would be a "liveness property". If it can never be satsified in any future state, then it would be a "safety property".

For example, let's imagine a bunch of servers serving a website. A property saying that "no server can ever be offline" would be a safety property because if in any state of the system, any server goes offline the property would not be satisfied and can never be satisfied in any future state after that. But a property saying that "all servers that go offline eventually come back online" would be a "liveness property" because there is always a possiblity of that being true in a future state.

TLA+ is a formal specification language where we can model systems and verify properties across all possible system states. It allows us to enumerate all possible states a system can be in and verify that specified properties hold true across all these states.

PlusCal is a DSL for TLA+. It sits at a higher level of abstraction and is much easier to specify complex systems in it compared to TLA+. The trade-off is that some kinds of systems cannot be specified (i.e it is more constrained than TLA+, but it is sufficient for most common use cases).

I went through learn tla in about 2 weeks (it was quite dense - I ended up learning some things about formal verification in order to understand that material fully). I thought that the best way to learn it would be to write a spec that actually does something useful and I thought verifying a zero- downtime database migration algorithm would be fun.

Let's assume a service exposes 3 APIs: Upsert either updates or creates a new object , Delete which deletes the object and Get which gets the object. We want to migrate these objects (for simplicity purposes let's assume its a table in a SQL database) from one database (olddb) to another (newdb).

The pseudocode for migration algorithm is as follows:

Note: the pseudocode assumes that the database is a mapping of primary keys to values (the objects in question are assumed to have a primary key and a value)

The Upsert operation:

Upsert(key, val) {
    newdb[key] = val
}
Delete(key, val) {
  // we need to put a TOMBSTONE instead of actually
  // deleting it so that the background migration
  // process does not accidentally insert a 
  // deleted value
  newdb[key] = TOMBSTONE
}
Get(key) {
    if key in newdb {
        val = newdb[key]
        if val == TOMBSTONE {
            return NULL
        } else {
            return val
        }
    } else if key in olddb {
        return olddb[key]
    }
    return NULL
}

And a background migration process that moves all the objects from olddb to newdb

migrate() {
  for key, val in olddb {

    // if does not exist is a conditional insert 
    newdb[key] = val if does not exist
  }
}

This is the pluscal spec for the above migration algo:

---- MODULE test ----
EXTENDS Integers, FiniteSets, TLC

Threads == {1,2} \* each thread represents a user
N == 1..4 \* primary keys of objects that user will use
Objects ==  [p \in N |-> p] \* the values that primary keys map to 

CONSTANTS 
    NULL,
    TOMBSTONE

(* --algorithm zero_downtime_migration

variables 
\* db is used to simulate a system without migration. 
\* olddb and newdb are used to simulate system with migration

db = [pkey \in 1..2 |-> pkey+1], \* some objects will already exist in db with different values when migration starts
olddb = [pkey \in 1..2 |-> pkey+1],
newdb = [pkey \in {} |-> NULL],

define 
  TypeOK == 
       \/ DOMAIN db \subseteq DOMAIN Objects
       \/ Cardinality(DOMAIN db) = 0

  GetObjectWOMigration(pkey) == 
    IF pkey \in DOMAIN db
    THEN db[pkey]
    ELSE NULL

  GetObjectWMigration(pkey) == 
    IF pkey \in DOMAIN newdb 
    THEN IF newdb[pkey] = TOMBSTONE 
         THEN NULL
         ELSE newdb[pkey]
    ELSE IF pkey \in DOMAIN olddb 
         THEN olddb[pkey]
         ELSE NULL

 \* the system with migration and without migration will be compared 
 \* this is our main invariant - we mimick a system where there is no 
 \* migration happening so that we can use that to check that our system
 \* where the migration is happening is correct
  ConsistentViews ==
    /\ \A pkey \in N: GetObjectWOMigration(pkey) = GetObjectWMigration(pkey)

end define;

\* WOMigration represent operations for the system without migration
\* WMigration represents operations for system with migration

macro UpsertWOMigration(db, key, val) begin
        db := (key :> val) @@ db;
end macro;

macro DeleteWOMigration(db, key) begin
  if key \in DOMAIN db then
    db := [p \in DOMAIN db \ {key} |-> db[p]]
  end if;
end macro;

macro UpsertWMigration(newdb, key, val) begin
        newdb := (key :> val) @@ newdb;
end macro;

macro DeleteWMigration(newdb, key) begin
  newdb := (key :> TOMBSTONE) @@ newdb
end macro;

\* each process represents 1 user
\* each will randomly pick an object and either upsert or delete it
process thread \in Threads

  variables
  i = 0,
  curPkey = NULL;

  begin
    Start:
      while i < 2 do
        PickObject:
          with pkey \in N do
            curPkey := pkey;
          end with;
        ChooseOperation:
          either 
            goto Upsert;
          or
            goto Delete;
          end either;
        Upsert:
          either
          \* we perform the same operation in the with and without migration system atomically
          \* at the end of each operation, both the systems should return the same values for all the objects
            UpsertWOMigration(db, curPkey, Objects[curPkey]);
            UpsertWMigration(newdb, curPkey, Objects[curPkey]);
            goto IncCounter;
          or 
          \* represents error - i.e 5xx to user. we need to error out in both the systems.
            goto IncCounter;
          end either;
        Delete:
         either
            DeleteWOMigration(db, curPkey);
            DeleteWMigration(newdb, curPkey);
            goto IncCounter;
         or
            goto IncCounter;
          end either;
        IncCounter:
          i := i + 1;
     end while;

  end process;

\* this is the background migration process that copies over values from oldb to newdb
process bg_migration = 3

  variables
  existingPkeys = DOMAIN olddb,
  curPkey = NULL,

  begin
    Start:
      while existingPkeys /= {} do 
        Pick:
          with pkey \in existingPkeys do
            curPkey := pkey;
            existingPkeys := existingPkeys \ {curPkey};
          end with;
        Migrate:
        \* *note*: this entire get and set is atomic (i.e create if not exist). 
        \* if you make the get and set non-atomic, the spec will fail (as it should)
          if curPkey \notin DOMAIN newdb
          then
            newdb := (curPkey :> olddb[curPkey]) @@ newdb;
          end if;          
      end while;

end process;

end algorithm; *)

Its amazing how the main part of the spec looks very similar to the pseudocode.

The trickiest thing was to figure out the invariant - i.e how to make sure that our migration algorithm was correct. For this, I simulate 2 services in the spec: one where the migration is happening and another where it is not. And our invariant will be that GET operations on all the objects give the exact same value in both the services.

Lets define the main variables involved:

Objects represents the objects that the users will work with. This is a function that maps primary keys of the objects to their values. Each object is assumed to have a primary key and a value. The spec uses 4 objects. db is the database used by the service without migration. olddb and newdb are are databases used by the service with migration. All the database variables are functions mapping primary keys of objects to values.

All the service APIs are written as macros. The APIs for service with migration end with WMigration and for the service without migration end with WOMigration

Note: the database updates are implemented as function updates in pluscal - and the database shows up on the right. This is because the operator @@ merges the keys in the left_function @@ right_function and uses the value of the key in the left_function if present in both the functions. The GET operation is implemented as an operator - because macros in pluscal do not return values.

One thing I realized is that there is a lot of freedom in modelling complex systems in TLA+. This lets us represent complex interactions in a very simple manner (for e.g a database insert operation as a function update). This freedom might seem a bit weird when just starting with plucal/tla+, but its very powerful once we get used to it.

Let's talk about the processses involved in the spec now.

The spec has simulates 2 users randomly picking pre-defined objects (4 objects) and then randomly performing either Upsert or Delete operations on them. Each user does this 2 times. The database starts out with some objects already present (2 objects). The objects that the users pick and perform operations on will include these already present objects also. The number of users and objects and the number of times an action is repeated might seem very few to test with to make sure that the system is working properly, but since the TLC model checker will create a state graph which includes every possible state, we can be sure that all the edge cases are covered. All actions happening in our system with migration will be mirrored in a system without migration atomically.

There is also a background migration process running which copies over the data from olddb to newdb.

The final migration step involves deleting the TOMBSTONES in the newdb, but for now the spec does not have it. In future I might add it and revise this blog post.

The true value of TLA+ spec shows up when the spec errors out. For example, when writing the spec I had written the GetObjectWMigration incorrectly:

GetObjectWMigration(pkey) == 
    IF pkey \in DOMAIN newdb 
    THEN newdb[pkey]
    ELSE IF pkey \in DOMAIN olddb 
         THEN olddb[pkey]
         ELSE NULL

compare this to the correct version:

GetObjectWMigration(pkey) == 
  IF pkey \in DOMAIN newdb 
  THEN IF newdb[pkey] = TOMBSTONE 
        THEN NULL
        ELSE newdb[pkey]
  ELSE IF pkey \in DOMAIN olddb 
        THEN olddb[pkey]
        ELSE NULL

The incorrect version does not handle the case where there is a TOMBSTONE present in newdb. Because of this the spec was failing.

This is a solid example of how TLA+ saves time: it lets us discover desgin flaws very early. Imagine debugging this error after writing out the entire migraiton algorithm: it would've hard to do and a very real chance that it'd have been missed altogether. Here, I just had to step through the state changes to figure out the exact cause of the spec failure.

One more way in which TLA+ helps is it forces us to think which steps must be atomic. If we come across any steps that cannot be atomic then it forces to consider the implications of concurrent execution. For example, in the background migration process above, before moving an item from olddb to newdb we check if the object already exists in the newdb (this maybe because it was deleted. If it was, we should not overwrite the TOMBSTONE). Checking whether the object exists and creating it in the newdb must be atomic (most databases have a IF DOES NOT EXIST support for conditional statements). While this may seem trivial, when writing the TLA+ spec it made it very clear and bought it to the focus. Important information like this when discovered early in the design phase can save a lot of time during implementation (for e.g here we would need to make sure that newdb has support for conditional statements when choosing the database)