IC internals: orthogonal persistence

Published:



Orthogonal persistence

I often describe the Internet Computer (IC) as a distributed operating system that hosts WebAssembly programs. One of my favorite features of the IC is orthogonal (or "transparent") persistence. Orthogonal persistence creates an illusion that a program runs forever without crashing or losing state. There is no need for the program to persist the state in files explicitly: the runtime handles persistence transparently.

In the early research on orthogonal persistence, state modifications looked like database transactions. Programmers had to mark transaction boundaries so that the system could identify safe snapshot points.

A slightly modified snippet of code in PS-algol from the paper "An Approach to Persistent Programming" by Atkinson et al. Note explicit calls to commit and abort procedures.
structure person(string name, phone ; pntr addr)
structure address(int no ; string street, town)

let db = open.database("addr.db", "write")
if db is error.record do { write "Can't open database"; abort }

let table = s.lookup("addr.table", db)
let p = person("al", 3250, address(76, "North St", "St Andrews"))
s.enter("al", table, p)

commit

Why cannot we take a snapshot after every instruction? This granularity is not only inefficient but also does not help us recover from failures. If the program crashes because of a faulty input, rolling back the instruction right before the crash is pointlessObligatory Futurama reference: the program will behave like Fry time-traveling to the moment right after the jump from the Vampire State Building (video)..

Persistent programming improved over manual serialization, but it does not feel "orthogonal" enough. We can do better.

Actors

Actors are a model of concurrent computation that treats systems as interactions of state machines. All the programs running on the IC are actors.

Pseudocode of an actor.
fn main() {
  let mut state = State::new();                // The initial state.
  loop {
    let msg = receive();                       // Input of the state machine.
    let out_msgs = dispatch(&mut state, &msg); // State transition function.
    send(out_msgs);                            // Output: more messages.
  }
}

The actor model and orthogonal persistence are a match made in heaven. Actors have a natural persistence boundary: the system can grab a snapshot of the actor's state at the moment when it is ready to process the next message. The programmer does not need to call commit explicitly to mark execution points where the system should persist the state.

Pseudocode of an IC actor with orthogonal persistence.
fn main() {
  let mut state = State::new();
  loop {
    let snapshot = state.clone();

    let input_message = receive();

    match dispatch(&mut state, &input_message) {
      Ok(out_msgs) => send(out_msgs),
      Trap(_) => {
        // If the message execution trapped, roll back to the last snapshot.
        state = snapshot;
        continue;
      }
    }
  }
}

The bulkiest component of an actor state on the IC is the WebAssembly instance state. A naive approach to snapshotting is to make a deep copy of all the data structures representing the state before each message execution. Most of the state data structures are cheap to clone; mutable globals are a good example. However, snapshotting and restoring Wasm memories pose a challenge: each memory can hold gigabytes of data. Copying the memory in full before each message execution is prohibitively expensive.

Let us see how the IC implementation responds to this challenge.

Snapshots and deltas

The implementation divides the contents of each memory into 4096-byte chunks called pages. When the actor executes a message, the system automatically detects the memory pages that the actor modifies or dirties. The system uses low-level UNIX APIs to detect page modifications. Since most operating systems operate on 4096-byte memory pages, using the same page size in the replica is the most natural choice.

The memory snapshot of an actor is a map from a page index to the page contents. We call this data structure a page map. There are many ways to implement this data structure. The primary assumption that guided the current implementation is that each message execution modifies only a small number of pages. This assumption holds in practice: as of April 2022, 95% of message executions change at most seven memory pages.

If the expected number of dirtied pages is small, it is natural to apply some delta-encoding scheme. Thus we chose to represent page maps as a combination of on-disk checkpoint files and in-memory page deltas. A checkpoint file is a flat binary file with a full copy of the actor memory that the system creates at the end of some execution rounds. Once created, checkpoint files are immutable. A page delta is a persistent map that contains pages dirtied by the actor since the last checkpoint.

Snapshots of an actor memory over several execution rounds. Dotted pages represent pages inherited from the checkpoint, dotted vertical lines — message execution boundaries, bold shapes — page deltas. Note that the system shares the contents of dirtied page among snapshots.

A
A
B
B
C
C
D
D
E
E
checkpoint C1
checkpoint C1
snapshot 1
snapshot 1
B'
B'
C1 + delta
C1 + delta
snapshot 2
snapshot 2
D'
D'
C1 + delta
C1 + delta
snapshot 3
snapshot 3
C1 + delta
C1 + delta
B''
B''
snapshot 4
snapshot 4
A
A
B''
B''
C
C
D'
D'
E
E
checkpoint C4
checkpoint C4
A
A
C
C
D
D
E
E
A
A
C
C
E
E
A
A
C
C
E
E
dirty page B
dirty page B
dirty page D
dirty page D
dirty page B
dirty page B
checkpoint
checkpoint
Viewer does not support full SVG 1.1

The runtime uses page maps in the following way:

  1. The system creates a "scratchpad" memory by constructing a patchwork from the checkpoint file and the running page deltas.
  2. The system executes a message on the actor state and detects the pages that the actor dirtied on the scratchpad.
  3. If the message execution succeeds, the system copies the contents of the dirty pages into the running delta.
  4. Otherwise, the system throws away the scratchpad.

Once in a while, the system creates a new checkpoint by cloning the original checkpoint file and flushing dirty pages to the new file. This procedure speeds up scratchpad construction and reduces memory pressure.

Detecting memory writes

Detecting touched and dirtied memory pages is another challenging aspect of orthogonal persistence. The system compiles actor's WebAssembly modules to native code. How do you know which pages arbitrary code touched? There are multiple approaches to this problem:

The signal handler

The memory protection API allows us to set read and write permissions on page ranges. If the process violates the permissions by reading a read-protected region or writing to a write-protected region, the operating system sends a signal (usually SIGSEGV) to the thread that triggered the violation. The signal handling API allows us to intercept these signals and inspect the address that caused them. We can build a robust and efficient memory access detection mechanism by combining these APIs.

Pseudocode of the memory access detection state machine.
let mut mem = construct_actor_memory();
let mut dirty_pages = bit_vector();
let mut touched_pages = bit_vector();

// Mark actor's memory as non-readable and non-writeabe.
mprotect(mem.start(), mem.len(), PROT_NONE);

set_signal_handler(SIGSEGV, |address, is_read| {
  touched_pages.set(address.page_index());

  if is_read {
    // Allow the process to read from the page.
    // Subsequent read from the page will not trigger the
    // signal handler, but writes to the page will.
    mprotect(address.page_start(), PAGE_SIZE, PROT_READ);
  } else {
    // Allow the process to read from and write to the page.
    // Subsequent reads from and writes to the page will not trigger
    // the signal handler.
    mprotect(address.page_start(), PAGE_SIZE, PROT_READ | PROT_WRITE);
    dirty_pages.set(address.page_index());
  }
});

actor.execute_message(&mut mem, message);

// observe dirty_pages and touched_pages

Let us see the state machine in action on a simple example: an actor stores an array of integers; it needs to find a specific integer in the array and replace it with another.

An example of a function triggering the memory access mechanism.
fn replace(array: &mut [u32], src: u32, dst: u32) {
  for i in 0..array.len() {
    if array[i] == src {
      array[i] = dst;
      break;
    }
  }
}

To keep the example simple, let us assume that the array starts at page zero, and the integer to replace is at position 2000 in the array. Before the execution begins, the entire memory has protection flags PROT_NONE. As the actor executes the replace function, the following events occur:

  1. The actor accesses array[0]. Since the corresponding memory page is read-protected, the load instruction generates a SIGSEGV signal. The system invokes the signal handler that marks the page as touched and sets the memory protection to PROT_READ. The context switches back to the actor code, which can now repeat the read operation successfully.
  2. The actor accesses array elements array[1] through array[1023]A 4096-byte memory page can fit 1024 32-bit integers. without interruption.
  3. The actor accesses array[1024]. This element lives on the next page, which is still read-protected. The system behaves just as in step ①.
  4. The actor accesses elements array[1025] through array[2000] without interruption.
  5. The actor writes to array[2000]. The corresponding memory page is write-protected, so the store instruction generates a SIGSEGV signal. The context switches to the signal handler. The signal handler marks the page as dirty and removes the write protection by settings the protection flags to PROT_READ | PROT_WRITE. The actor code resumes and repeats the write, this time without interruption.
States of the memory mapping during replace function execution. Rectangles represent memory pages, hollow arrows — reads, solid arrows — writes.

PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
 if array[0] == src
① if array[0] == src
PROT_READ
PROT_READ
PROT_NONE
PROT_NONE
PROT_READ
PROT_READ
PROT_READ
PROT_READ
 if array[1024] == src
③ if array[1024] == src
 array[2000] = dst
⑤ array[2000] = dst
PROT_READ
PROT_READ
PROT_READ | PROT_WRITE
PROT_READ | PROT_WRITE
PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
PROT_NONE
touched: {}, dirty: {}
touched: {}, dirty: {}
touched: {0}, dirty: {}
touched: {0}, dirty: {}
touched: {0, 1}, dirty: {}
touched: {0, 1}, dirty: {}
touched: {0, 1}, dirty: {1}
touched: {0, 1}, dirty: {1}
Viewer does not support full SVG 1.1

The signal handler approach is quite efficient in practice, and we can optimize it further:

Surviving upgrades

Orthogonal persistence creates an illusion that the program runs forever without interruption. But what if we want to fix a bug in that program or add a new feature? Can we do the upgrade without losing the program state?

At first sight, it might seem safe to swap the WebAssembly module of an actor without changing the contents of its memory. Unfortunately, this seldom works in practice.

The only known way to solve this issue is to rely on stable data representations. There is a proposal that enables WebAssembly programs to access multiple disjoint memories. Andreas Rossberg suggested using the multi-memory feature to refine the persistence model with memory lifetimes. For example, an actor could use the default memory (the memory with index zero) as mid-term storage that depends on the compiler's memory representation. The actor could use other memories as long-term storage with a stable (ideally, language-independent) data layout. The system would wipe out the transient memory during the upgrade but persist the long-termAndreas also proposed short-term memories with a lifetime of a single message execution. Such memories open up attractive optimization opportunities, but we did not have enough pressure to implement them. storage.

Since the IC needed to support upgrades before the multi-memory proposal was implemented, the runtime emulates one extra memory via the stable memory API. This API intentionally mimics WebAssembly memory instructions to facilitate future migration to multi-memory modules. Internally, the main memory and the stable memory share the representation.

Currently, most actors serialize their entire state to stable memory in the pre-upgrade hook and read it back in the post-upgrade hook. There is an ongoing effort to build tools that make it easier to use stable memory as the primary data storage.

Conclusion

We examined the classical approach to orthogonal persistence and its synergy with the actor model. We learned about the page map, the data structure that IC replicas use to store multiple versions of an actor's memory efficiently, and the SIGSEGV-based memory access detection system. Lastly, we saw that orthogonal persistence is not the final solution to state persistence and why we need better tools to handle program upgrades.

References

The IC replica code is open source. In case you want to read the code that implements ideas from this article, here are a few pointers: