A swarm of replicated state machines

Published:


In this article, we shall view the Internet Computer (IC) through the lens of distributed system design. As most other blockchains, the IC achieves fault-tolerance using strategy called state machine replication. We shall take a close look at some design choices that make the IC fast, scalable, and secure.

The state machine

Before we dive into the internals of the protocol, let's first define the state machine that we'll be dealing with. Nodes participating in the Internet Computer are grouped into units called subnet blockchains, or simply subnets. Nodes in the same subnet run their own instance of the core Internet Computer Protocol (i.e., they form an independent peer-to-peer network and reach consensus independently from other subnets). We'll model as a state machine the computation that nodes in the same subnet perform.

Let's play by the book and define components of our state machine:

Inputs
The key product of the consensus protocol is a sequence of blocks containing messages for canisters hosted by the subnet. These blocks are the inputs of our state machine.
Outputs
In our model, the main artifact of the execution is a data structure called state tree. We'll learn more about state trees in a moment.
States
The single most important thing that the Internet Computer does is hosting canisters. Thus we'll define the state as a data structure containing everything we need to serve canisters installed on the subnet, including but not limited to:
  • WebAssembly modules and configuration of the canisters.
  • Memory and stable memory of those canisters.
  • Messages in canister mailboxes.
  • Results of the recent ingress messages.
Transition function
When a replica receives a block, it
  • Injects messages from the block into the canister mailboxes.
  • Picks some canisters for execution according to a deterministic scheduling algorithm.
  • Executes the messages on the selected canisters and records the execution results.
All of the above modifies the data structure that we call "state" and acts as a transition function. Note that we can call this procedure a function only if it's deterministic: given the same block and the same original state, the replica will modify the state in exactly the same way. Thanks to the careful design of execution algorithms and guarantees that WebAssembly provides, the procedure is indeed deterministic.
Output function
Once a replica processed a block, it computes a state tree. This tree can be used to inspect the results of the execution and validate the authenticity of those results. The deterministic procedure that constructs a state tree from the state is our output function.
Initial state
Each subnet starts its life with no canisters, no messages, and no results to inspect. It's as boring as it gets.

I call these state machines (one for each subnet) replicated because each honest node on a subnet has an exact copy of the machine.

Checkpoints

Let's say we want to add a new node to an existing subnet because a flood destroyed one of the data centers hosting the subnet. This new node cannot start processing and proposing new blocks until it has the right state, the state that results from execution of all the blocks produced by this subnet so far.

One way to bring the node up to date is to download all those blocks and "replay" them. This sounds simple, but if the rate of change is high and message execution is costly, the new node might need a lot of time to catch up. As the Red Queen put it: “My dear, here we must run as fast as we can, just to stay in place. And if you wish to go anywhere you must run twice as fast as that.”

Another solution is to create persistent snapshots of the state from time to time. The peers can fetch and load those snapshots when they need help. This method works really well for our state machine: it reduces the catch up time from days to minutes. Let's call those persistent snapshots checkpoints.

Components of the state machine: blocks as inputs, states, state trees as outputs, and checkpoints.

B1
B1
B2
B2
B3
B3
S0
S0
S1
S1
S2
S2
S3
S3
C3
C3
 
 
Blocks
Blocks
States
States
State Trees
State Trees
Checkpoints
Checkpoints
 
 
 
 
1
1
2
2
3
3
Viewer does not support full SVG 1.1

State trees

The transition function is complex, but its details aren't very important for our discussion. We can treat block processing as a black box. Let's now see how we get the data out of the state machine.

There are quite a few bits of information that we want to access, for example:

Furthermore, since we cannot trust any particular node, we want to have some authenticity guarantees for the data we get back. Sounds easy: hash all the relevant bits of the state, collect a threshold signature on that hash, and use the signature as a proof of state authenticity. Problem solved?

But how do can we validate a single request status if we have a signature on the full state? Collecting a separate signature for each request would solve the problem, but the cost of this approach is unacceptable from the performance point of view.

Wouldn't it be great to be able to "zoom" into different parts of the state to produce replies for different clients, while still having only a single hash to sign? Enter state trees.

The logical structure of a state tree.

reply
reply
"IC is fun!"
"IC is fun!"
<request_id>
<request_id>
<subnet_id>
<subnet_id>
<canister_id>
<canister_id>
canister
canister
stream
stream
request_status
request_status
messages
messages
...
...
certified_data
certified_data
cafe..beaf
cafe..beaf
S8
S8
state tree
state tree
root hash
root hash
threshold σ
threshold σ
certification
certificat...
Viewer does not support full SVG 1.1

The state tree is a data structure that contains all outputs of our state machine in a form of a merkle tree. Once the gears of the execution stopped, the system computes the root hash of the state tree corresponding to the newly computed state, starts collecting a threshold signature for that hash, and moves on to the next block.

Lookup

Let's look at an example to see how the state tree does its magical zooming. Assume that you sent a request with id 1355...48de to the IC and you want to get back the reply. As we now know, the system will put the reply into a state tree, so let's make a read_state request with path "/request_status/1355...48de/reply".

The replica processes our request in the following way

  1. Check that the caller has permission to look at the paths listed in the read_state request.
  2. Get the latest certified state tree (i.e. the state tree with a complete threshold signature on its root hash).
  3. Build the result tree that includes all the paths from the read_state request, with all the pruned branches replaced by their hashes.
  4. Combine the result tree with the threshold signature to form a full certified reply.

The tree that you'll get back will look something like this:

The logical structure of a tree containing a response to an ingress message.

reply
reply
"IC is fun!"
"IC is fun!"
1355..48de
1355..48de
<pruned>
<pruned>
<pruned>
<pruned>
request_status
request_status
pruned state tree
pruned state tree
<pruned>
<pruned>
<pruned>
<pruned>
Viewer does not support full SVG 1.1

Even though the pruned tree is much smaller than the full state tree, both trees have exactly the same root hash. So we can validate the authenticity of the pruned tree using the threshold signature that consensus collected for the root hash of the full state tree.

State transfer

State as an artifact

As we discussed in the checkpoints section, replicas periodically persist snapshots of its state to disk. The main purpose of these snapshots is to speed up state recovery. If a replica was out for a brief period of time, it can use its own checkpoint to recover more quickly than replaying all the blocks starting from the genesis. Load the checkpoint, replay a few blocks, and you're ready to rock. There is a more interesting case, however: a healthy replica can help other replicas catch up by sending them a recent checkpoint.

Replicas in a subnet communicate by exchanging artifacts using a peer-to-peer protocol. Most of these artifacts (e.g., user ingress messages, random beacons, state certifications) are relatively small, up to a few megabytes in size. But the machinery for artifact transfer is quite general: the protocol supports fetching arbitrary large artifacts by slicing them into chunks, provided that there is a way to authenticate each chunk independently. Furthermore, multiple chunks can be fetched in parallel from multiple peers. Sounds a lot like BitTorrent, isn't it?

Before advertising a checkpoint, replica computes a manifest for that checkpoint. Manifest is an inventory of files constituting a checkpoint. Files are sliced into chunks, and the manifest enumerates paths, sizes and cryptographic hashes of every file and every chunk of each file. In our BitTorrent analogy, manifest plays a role of a .torrent file. If we have a manifest, we know for sure how much data we need to fetch to construct a checkpoint, and how to arrange this data. Hashes of file chunks in the manifest allow us to validate each chunk independently before we put it on disk. Replicas use the hash of the manifest itself when they advertise a checkpoint in the peer-to-peer network.

A replica advertising a checkpoint as an artifact.

C9
C9
S9
S9
/canisters/.../queues.pb
/canisters/.../queues.pb
/canisters/.../memory.bin
/canisters/.../memory.bin
0
0
0
0
1
1
2
2
manifest (h=9)
manifest (h=9)
advertize(h=9, full state hash)
full state hash
full state hash
Node 2
Node 2
Node 3
Node 3
Node 4
Node 4
Viewer does not support full SVG 1.1

Triggering state transfer

Let's assume that we have a replica that needs to fetch the latest checkpoint. It listens to the peers, and discovers a few state artifacts with different hashes advertised by different peers. How does our poor replica decide which state it needs to fetch?

As you might have guessed, the consensus subsystem armed with threshold signatures comes to the rescue again. Replicas gather a threshold signature on a full state hash and use that signature as a proof of checkpoint authenticity. The result is an artifact containing a state height, a full state hash, and a threshold signature. We'll call this artifact a catch-up package.

The interaction between the replica consensus module and the state machine is something like the following

  1. Consensus sees a catch-up package for state 100 with a valid threshold signature and the state hash is H100. Consensus asks the state machine "Hey, what's your state height?".
  2. State machine: "It's nine. Why?"
  3. Consensus: "We're missing out. Fetch the checkpoint for state 100, but only if it has root hash H100."
  4. State machine: "Sure, I'm on it." The state machine starts looking for state artifact advertisements with a matching hash.

Yes, the consensus module can be a bit bossy sometimes, but it always acts with the best of intentions.

Fetching states incrementally

Let's now have a brief look at the most juicy part of state transfer, the actual state fetch protocol. Let's suppose that we have a replica that has state 9 and it wants to catch up to state 100 with hash H.

  1. The replica receives advertisements for checkpoint artifacts from other peers and picks the peers that advertize the state with the hash H.
  2. The replica fetches the manifest of checkpoint 100 from one of the peers and validates that the manifest hash is indeed H.
  3. The replica compares the manifest of checkpoint 9 that it has locally to the manifest of checkpoint 100.
  4. The replica copies all the chunks with the matching hashes from the old checkpoint into the new one. Why waste network bandwidth and fetch data you already have?
  5. The replica fetches all the missing chunks from the peers, validates them against the manifest, and puts on disk them where they belong.

When there are no more chunks to fetch, checkpoint 100 is complete, and the replica is ready to go.

A replica constructing a fresh checkpoint by re-using existing chunks and fetching the missing ones.

C100
C100
S100
S100
/canisters/.../queues.pb
/canisters/.../queues.pb
/canisters/.../memory.bin
/canisters/.../memory.bin
0
0
0
0
1
1
2
2
manifest (h=100)
manifest (h=100)
Network
Network
Up-to-date node
Up-to-date...
C100
C100
S100
S100
/canisters/.../queues.pb
/canisters/.../queues.pb
/canisters/.../memory.bin
/canisters/.../memory.bin
0
0
0
0
1
1
2
2
manifest (h=9)
manifest (h=9)
C9
C9
0
0
0
0
1
1
manifest (h=100)
manifest (h=100)
Catching-up node
Catching-up...
Viewer does not support full SVG 1.1

As you can see, the state transfer procedure is incremental: if the catching-up replica was offline for a brief period of time, it needs to fetch only the data that actually changed in the meantime. Of course, a replica that has no checkpoints at all will have to fetch all the chunks to construct its first checkpoint.

Conclusion

In this article, we

This concludes our overview of how the IC implements state machine replication on the scale of a single subnet. From that prospective, the IC as a whole is really a swarm of replicated state machines!

I made a few simplifications and omitted a lot of details to keep us focused on the replication aspect. For example, we didn't look at how different subnets communicate with one another, how individual worker bees form the swarm. This is a great topic that deserves an article of its own, so stay tuned!