IC internals: XNet protocol
Published: 2022-07-01 Last updated: 2022-07-01
- Message streams
- Block payloads
- Garbage collection
- Stream certification
- Code references
This article continues the previous post on state machine replication in the Internet Computer (IC), A swarm of replicated state machines. This time, we shall examine the protocol over which independent state machines (also known as subnets) communicate, the XNet protocol.
A subnet is a collection of nodes participating in a single instance of the consensus protocol. All the nodes in a subnet have the same state and apply the same blocks.
It might be tempting to think that nodes of a subnet are physically collocated. However, the opposite is usually true: subnet nodes are often geographically distributed across independent data centersThere are still good reasons to assign multiple nodes in a data center to the same subnet, such as increasing query call capacity and speeding up recovery in case of a replica restart.. The goal is to improve availability: a disaster in a single data center cannot take down the entire subnet. The registry canister maintains the assignment of nodes to physical machines, and the Network Nervous System governs all the changes to the registry.
Nodes from different subnets can live in the same data center. This setup might be counter-intuitive: nodes from different subnets might sometimes have better network connectivity than nodes in the same subnet.
XNet protocol is not limited to the same-datacenter communication; any node from one subnet can talk to any other node from another subnet. However, the replica prefers contacting closeThe replica uses network latency as the proximity measure. It randomly contacts nodes from other subnets, assigning higher weights to nodes with lower latency. peers to reduce the message delivery latency.
Messages start their journey in the output queues of canisters hosted on the source subnet. The subnet sorts these messages based on the destination subnet and interleaves them into flat message streams (one stream per destination subnet). Each message in the stream gets a unique monotonically increasing index.
The component merging the queues (aptly called stream builder) should satisfy a few constraints:
- All nodes in the subnet must agree on the exact contents of the stream. To do its job, the stream builder needs the mapping from canister identifiers to subnet identifiers, the routing table. The routing table comes from the registry and changes over time. To ensure determinism, each block pins the registry version that the state machine is using for message processing.
- If canister A sends two requests to canister B, R1, and R2, then R1 should appear before R2 in the stream.
- We do not want messages from a single chatty canister to dominate the stream. The stream builder tries to interleave messages so that each canister has the same bandwidth.
When a node from subnet X decides to get new messages from subnet Y, it picks a node from Y (the registry stores the entire network topology) and calls Y's XNet endpoint with the index of the first unseen message and the number of messages to fetch.
The job of the consensus protocol is to aggregate messages from the outside world and pack them into a neat block. Consensus includes several types of messages into blocks, such as user ingress messages, Bitcoin transactions (for subnets with enabled Bitcoin integration), and inter-canister messages from other subnets. We call messages paired with the data required for their validation a payload. We call components that pull payloads from the network payload builders.
XNet payload builder pulls messages from nodes assigned to other subnets using a simple HTTP protocol. XNet endpoint is a component that serves messages destined for other subnets over secure TLS connections, accepting connections only from other nodesThis measure does not imply privacy because malicious nodes can access the data; but ensures that network providers cannot read the messages.. XNet endpoint fetches the complete list of nodes, their subnet assignment, IP addresses, and public keys (required to establish a TLS connection) from the registry.
We now know how one subnet accumulates messages destined for another subnet. This knowledge begs another question: how do replicas remove messages that the destination subnet has already consumed? We need a feedback mechanism allowing the consumer subnet to tell the producer subnet that it does not need some stream prefix anymore. We call this mechanism signals.
Signals are a part of the XNet payload specifying the prefix of the reverse stream that the sending subnet can drop. When a node from subnet X fetches XNet payload from a node from subnet Y, in addition to actual messages, the Y node includes the stream header. The stream header describes the state of the X ↔ Y communication:
- The full range of message indices in the forward stream X → Y.
- The signals for the reverse stream (Y → X): for each message index in the reverse stream, Y tells whether X can garbage collect the message (an
ACKsignal) or should reroute the message (a
REJECTsignal indicates that the destination canister moved, so X should route the message into another stream.
Signals solve the issue of collecting obsolete messages, but they introduce another problem now we also need to garbage-collect signals! Luckily, we already have all the information we need: we keep signals only for messages that are still present in the reverse stream. Once we notice (by looking at the range of message indices) that the remote subnet dropped messages from its stream, we can remove the corresponding signals from our header.
From my experience, the interplay of message indices and signals is notoriously hard to grasp, so let us look at an example. The diagram below depicts two subnets, X and Y, caught in the middle of communication. Subnet X gets a prefix of Y's stream and the header, allowing X to induct new messages and garbage-collect its messages and signals. X also publishes signals for the newly received messages and updates its indices accordingly so that Y could garbage-collect its messages and signals in the next round of communication.
Signals and stream messages are like snakes eating each other's tails on the Auryn: the sender drops messages when it sees signals, and the receiver drops signals when it sees stream bounds advancing.
Since no two nodes in the network can blindly trust each other, we need a mechanism to validate the contents of XNet payloads. More specifically, we want proof that the stream prefix in the payload is the same on all honest nodes in the subnet that produced the stream. In the previous article, we have seen how nodes use threshold signatures on state trees as authenticity proofs for call replies. The XNet protocol relies on the same trick: nodes include streams destined for other subnets into their state trees and use threshold signatures as proofs of authenticity for nodes from other subnets.
Conceptually, a node requesting messages from a subnet is not different from a client requesting a response. This similarity allows us to use the sameCurrently, there is a minor difference in how we represent certificates in the XNet protocol and the IC HTTP interface. The HTTP interface certificates combine the data and the hashes required to check the authenticity in a single data structure, the hash tree. The XNet protocol separates the raw message data and the hashes (called the witness) into separate data structures. The reason for that distinction is purely historical. We implemented the XNet protocol significantly earlier than the response authentication, so we could not benefit from Joachim Breitner's brilliance. Joachim took the XNet authentication scheme as the starting point and simplified it for the IC Interface Specification. authentication mechanism in both cases.
The tree structure of the certificate comes in handy for buffering messages on the receiver: the receiver can maintain a slightly larger pool of messages than a single block can fit, asynchronously fetching new messages and appending them to the pool, merging the certificates appropriately. This optimization allows us to use the subnet's narrow XNet channel more efficiently and improve fairness when we include messages from multiple subnets into a single block.
The XNet protocol is a marvelous feat of engineering, but it has a few inherent limitations:
- All XNet messages have to go through consensus blocks. This constraint bounds the theoretical throughput of the protocol by the block rate and size. For example, if a subnet produces one block every second and the block size is 2MiB, the XNet throughput on this subnet can be at most 2MiB/sec.
- Delivering a message requires two rounds of consensus (one round on the sender, one round on the receiver). This constraint bounds the latency of the protocol by the block rate. If both the receiving and the sending subnet need a second to produce a block, a request will need at least two seconds to reach the destination. Delivering the response will need another two rounds, so the call round-trip time is at least four seconds.
A few pointers to the code implementing designs from this article:
- The routing table.
- The stream builder.
- The payload builder and the XNet message buffer.
- The XNet endpoint and the node proximity metric computation.
- The induction of messages into the state machine state.
- The garbage collection of messages and signals.
- The mapping of streams to the tree structure, tree encoding, and validation.
Finally, I shall give due credit to people who played an essential role in the protocol development.
- Allen Clement is the mastermind behind most of the features of the XNet protocol.
- David Derler refined many of Allen's ideas and wrote a formal mathematical specification for the protocol.
- Alin Sinpalean implemented the protocol in the IC replica and developed many optimizations.
My main contributions are stream certification and TLS integration.