IC internals: orthogonal persistence
I often describe the Internet Computer (IC) as a distributed operating system that hosts WebAssembly programs. One of my favorite features of the IC is orthogonal (or "transparent") persistence. Orthogonal persistence creates an illusion that a program runs forever without crashing or losing state. There is no need for the program to persist the state in files explicitly: the runtime handles persistence transparently.
In the early research on orthogonal persistence, state modifications looked like database transactions. Programmers had to mark transaction boundaries so that the system could identify safe snapshot points.
Why cannot we take a snapshot after every instruction? This granularity is not only inefficient but also does not help us recover from failures. If the program crashes because of a faulty input, rolling back the instruction right before the crash is pointlessObligatory Futurama reference: the program will behave like Fry time-traveling to the moment right after the jump from the Vampire State Building (video)..
Persistent programming improved over manual serialization, but it does not feel "orthogonal" enough. We can do better.
Actors are a model of concurrent computation that treats systems as interactions of state machines. All the programs running on the IC are actors.
The actor model and orthogonal persistence are a match made in heaven.
Actors have a natural persistence boundary: the system can grab a snapshot of the actor's state at the moment when it is ready to process the next message.
The programmer does not need to call
commit explicitly to mark execution points where the system should persist the state.
The bulkiest component of an actor state on the IC is the WebAssembly instance state. A naive approach to snapshotting is to make a deep copy of all the data structures representing the state before each message execution. Most of the state data structures are cheap to clone; mutable globals are a good example. However, snapshotting and restoring Wasm memories pose a challenge: each memory can hold gigabytes of data. Copying the memory in full before each message execution is prohibitively expensive.
Let us see how the IC implementation responds to this challenge.
Snapshots and deltas
The implementation divides the contents of each memory into 4096-byte chunks called pages. When the actor executes a message, the system automatically detects the memory pages that the actor modifies or dirties. The system uses low-level UNIX APIs to detect page modifications. Since most operating systems operate on 4096-byte memory pages, using the same page size in the replica is the most natural choice.
The memory snapshot of an actor is a map from a page index to the page contents. We call this data structure a page map. There are many ways to implement this data structure. The primary assumption that guided the current implementation is that each message execution modifies only a small number of pages. This assumption holds in practice: as of April 2022, 95% of message executions change at most seven memory pages.
If the expected number of dirtied pages is small, it is natural to apply some delta-encoding scheme. Thus we chose to represent page maps as a combination of on-disk checkpoint files and in-memory page deltas. A checkpoint file is a flat binary file with a full copy of the actor memory that the system creates at the end of some execution rounds. Once created, checkpoint files are immutable. A page delta is a persistent map that contains pages dirtied by the actor since the last checkpoint.
The runtime uses page maps in the following way:
- The system creates a "scratchpad" memory by constructing a patchwork from the checkpoint file and the running page deltas.
- The system executes a message on the actor state and detects the pages that the actor dirtied on the scratchpad.
- If the message execution succeeds, the system copies the contents of the dirty pages into the running delta.
- Otherwise, the system throws away the scratchpad.
Once in a while, the system creates a new checkpoint by cloning the original checkpoint file and flushing dirty pages to the new file. This procedure speeds up scratchpad construction and reduces memory pressure.
Detecting memory writes
Detecting touched and dirtied memory pages is another challenging aspect of orthogonal persistence. The system compiles actor's WebAssembly modules to native code. How do you know which pages arbitrary code touched? There are multiple approaches to this problem:
- Compare the full memory before and after execution. This approach is easy to implement and works on all platforms, but it performs poorly on sizable memories and does not give us accessed pages.
- Instrument the WebAssembly module before compiling it to native code.
For example, we could add a system call before all
storeinstructions to mark the corresponding pages as accessed or dirtied. This approach works on all platforms but can slow down execution by an order of magnitude.
- Use memory protection and signal handlers to detect memory reads and writes at runtime. This approach works well on UNIX platforms and is reasonably efficient. We shall discuss this approach in more detail shortly.
- Implement a custom filesystem backend as a FUSE library, then memory-map actor memory as a virtual file. The operating system will call into our library when the actor code reads from or writes to the memory-mapped file, allowing us to perform the required bookkeeping. This approach works great on platforms that support FUSE (Linux and macOS), but it has some administrative disadvantages. For example, the replica will need priveledged access to be able to mount a virtual filesystem.
- Use Linux pmap utility to extract memory access statistics right after the message execution. This approach is very efficient, but it does not always produce deterministic results.
The signal handler
The memory protection API allows us to set read and write permissions on page ranges.
If the process violates the permissions by reading a read-protected region or writing to a write-protected region, the operating system sends a signal (usually
SIGSEGV) to the thread that triggered the violation.
The signal handling API allows us to intercept these signals and inspect the address that caused them.
We can build a robust and efficient memory access detection mechanism by combining these APIs.
Let us see the state machine in action on a simple example: an actor stores an array of integers; it needs to find a specific integer in the array and replace it with another.
To keep the example simple, let us assume that the array starts at page zero, and the integer to replace is at position
2000 in the array.
Before the execution begins, the entire memory has protection flags
As the actor executes the
replace function, the following events occur:
- The actor accesses
array. Since the corresponding memory page is read-protected, the load instruction generates a
SIGSEGVsignal. The system invokes the signal handler that marks the page as touched and sets the memory protection to
PROT_READ. The context switches back to the actor code, which can now repeat the read operation successfully.
- The actor accesses array elements
arrayA 4096-byte memory page can fit 1024 32-bit integers. without interruption.
- The actor accesses
array. This element lives on the next page, which is still read-protected. The system behaves just as in step ①.
- The actor accesses elements
- The actor writes to
array. The corresponding memory page is write-protected, so the store instruction generates a
SIGSEGVsignal. The context switches to the signal handler. The signal handler marks the page as dirty and removes the write protection by settings the protection flags to
PROT_READ | PROT_WRITE. The actor code resumes and repeats the write, this time without interruption.
The signal handler approach is quite efficient in practice, and we can optimize it further:
- Instead of constructing the full actor memory in advance, we can populate it lazily when the actor first touches it. This optimization works best for fragmented memories with lots of page deltas.
- Sequential memory access is a pervasive pattern worth optimizing. We can remove read protection from multiple consequent pages at once, removing the need to switch the execution context to the signal handler for each page. This optimization makes the mechanism less precise but can significantly improve execution speed.
- We can speculatively remove write protection from multiple consequent pages to speed up sequential writes. We can later compare the contents of dirtied pages with the snapshot to validate our speculation.
Orthogonal persistence creates an illusion that the program runs forever without interruption. But what if we want to fix a bug in that program or add a new feature? Can we do the upgrade without losing the program state?
At first sight, it might seem safe to swap the WebAssembly module of an actor without changing the contents of its memory. Unfortunately, this seldom works in practice.
- The actor code expects a specific memory layout; changing data type definitions affects the layout. The new version of the code will not be able to make sense of the old memory.
- The memory might contain pointers to functions. Changing the code might invalidate these pointers.
- The code might have other expectations about the data besides the memory layout. For example, the new version of the code might use a different hash function for hash tables. Hash table lookups might start returning invalid results after the code swap.
The only known way to solve this issue is to rely on stable data representations. There is a proposal that enables WebAssembly programs to access multiple disjoint memories. Andreas Rossberg suggested using the multi-memory feature to refine the persistence model with memory lifetimes. For example, an actor could use the default memory (the memory with index zero) as mid-term storage that depends on the compiler's memory representation. The actor could use other memories as long-term storage with a stable (ideally, language-independent) data layout. The system would wipe out the transient memory during the upgrade but persist the long-termAndreas also proposed short-term memories with a lifetime of a single message execution. Such memories open up attractive optimization opportunities, but we did not have enough pressure to implement them. storage.
Since the IC needed to support upgrades before the multi-memory proposal was implemented, the runtime emulates one extra memory via the stable memory API. This API intentionally mimics WebAssembly memory instructions to facilitate future migration to multi-memory modules. Internally, the main memory and the stable memory share the representation.
Currently, most actors serialize their entire state to stable memory in the pre-upgrade hook and read it back in the post-upgrade hook. There is an ongoing effort to build tools that make it easier to use stable memory as the primary data storage.
We examined the classical approach to orthogonal persistence and its synergy with the actor model. We learned about the page map, the data structure that IC replicas use to store multiple versions of an actor's memory efficiently, and the SIGSEGV-based memory access detection system. Lastly, we saw that orthogonal persistence is not the final solution to state persistence and why we need better tools to handle program upgrades.
The IC replica code is open source. In case you want to read the code that implements ideas from this article, here are a few pointers: