IC internals: XNet protocol

Published:


Introduction

This article continues the previous post on state machine replication in the Internet Computer (IC), A swarm of replicated state machines. This time, we shall examine the protocol over which independent state machines (also known as subnets) communicate, the XNet protocol.

Subnets

A subnet is a collection of nodes participating in a single instance of the consensus protocol. All the nodes in a subnet have the same state and apply the same blocks.

It might be tempting to think that nodes of a subnet are physically collocated. However, the opposite is usually true: subnet nodes are often geographically distributed across independent data centersThere are still good reasons to assign multiple nodes in a data center to the same subnet, such as increasing query call capacity and speeding up recovery in case of a replica restart.. The goal is to improve availability: a disaster in a single data center cannot take down the entire subnet. The registry canister maintains the assignment of nodes to physical machines, and the Network Nervous System governs all the changes to the registry.

Nodes from different subnets can live in the same data center. This setup might be counter-intuitive: nodes from different subnets might sometimes have better network connectivity than nodes in the same subnet.

Two three-node subnets located in three data centers. Solid lines represent peer-to-peer communication within a subnet; dotted lines — cross-subnet communication. Not all XNet connections are present in the picture.

DC 1
xnet
xnet
X1
X1
Y1
Y1
DC 3
xnet
xnet
X3
X3
Y3
Y3
DC 2
xnet
xnet
X2
X2
Y2
Y2
consensus
consensus
consensus
consensus
consensus
consensus
Text is not SVG - cannot display

XNet protocol is not limited to the same-datacenter communication; any node from one subnet can talk to any other node from another subnet. However, the replica prefers contacting closeThe replica uses network latency as the proximity measure. It randomly contacts nodes from other subnets, assigning higher weights to nodes with lower latency. peers to reduce the message delivery latency.

Message streams

Messages start their journey in the output queues of canisters hosted on the source subnet. The subnet sorts these messages based on the destination subnet and interleaves them into flat message streams (one stream per destination subnet). Each message in the stream gets a unique monotonically increasing index.

The flow of messages within a subnet. Canisters push messages into their output queues; the stream builder picks up these messages and constructs a flat outgoing message stream for each subnet.

Routing Table
A → S1
A → S1
B → S2
B → S2
C → S3
C → S3
D → S1
D → S1
E → S3
E → S3
Queue A → B
AB2
AB2
AB3
AB3
AB1
AB1
Queue D → E
DE2
DE2
DE1
DE1
DE3
DE3
Queue A → C
AC1
AC1
AC2
AC2
AC3
AC3
Stream Builder
Stream Builder
Stream S1 → S2
AB1
AB1
1
1
AB2
AB2
2
2
AB3
AB3
3
3
Stream S1 → S3
DE3
DE3
6
6
AC3
AC3
5
5
DE2
DE2
4
4
AC1
AC1
1
1
DE1
DE1
2
2
AC2
AC2
3
3
A
A
D
D
Text is not SVG - cannot display

The component merging the queues (aptly called stream builder) should satisfy a few constraints:

Determinism
All nodes in the subnet must agree on the exact contents of the stream. To do its job, the stream builder needs the mapping from canister identifiers to subnet identifiers, the routing table. The routing table comes from the registry and changes over time. To ensure determinism, each block pins the registry version that the state machine is using for message processing.
Ordering
If canister A sends two requests to canister B, R1, and R2, then R1 should appear before R2 in the stream.
Fairness
We do not want messages from a single chatty canister to dominate the stream. The stream builder tries to interleave messages so that each canister has the same bandwidth.

When a node from subnet X decides to get new messages from subnet Y, it picks a node from Y (the registry stores the entire network topology) and calls Y's XNet endpoint with the index of the first unseen message and the number of messages to fetch.

Block payloads

The job of the consensus protocol is to aggregate messages from the outside world and pack them into a neat block. Consensus includes several types of messages into blocks, such as user ingress messages, Bitcoin transactions (for subnets with enabled Bitcoin integration), and inter-canister messages from other subnets. We call messages paired with the data required for their validation a payload. We call components that pull payloads from the network payload builders.

The consensus algorithm aggregates messages from the outside world into blocks.

Ingress payload
Ingress payload
XNet payload
XNet payload
Bitcoin payload
Bitcoin payload
Users
Users
Other subnets
Other subnets
Bitcoin adapter
Bitcoin adapter
Consensus
Consensus
Ingress payload
Ingress payload
XNet payload
XNet payload
Bitcoin payload
Bitcoin payload
Block
Block
Text is not SVG - cannot display

XNet payload builder pulls messages from nodes assigned to other subnets using a simple HTTP protocol. XNet endpoint is a component that serves messages destined for other subnets over secure TLS connections, accepting connections only from other nodesThis measure does not imply privacy because malicious nodes can access the data; but ensures that network providers cannot read the messages.. XNet endpoint fetches the complete list of nodes, their subnet assignment, IP addresses, and public keys (required to establish a TLS connection) from the registry.

The data flow in the XNet protocol. Subnet Y produces a signed stream of messages for subnet X and exposes this stream via an HTTP endpoint. Subnet X pulls messages from one of the nodes on subnet Y.

Subnet X
Payload Builder
Payload Builder
XNet payload
XNet payload
Consesus
Consesus
Subnet Y
XNet endpoint
XNet endpoint
Certified stream
Certified stream
State Machine
State Machine
1
1
2
2
3
3
4
4
5
5
6
6
https
https
Text is not SVG - cannot display

Garbage collection

We now know how one subnet accumulates messages destined for another subnet. This knowledge begs another question: how do replicas remove messages that the destination subnet has already consumed? We need a feedback mechanism allowing the consumer subnet to tell the producer subnet that it does not need some stream prefix anymore. We call this mechanism signals.

Signals are a part of the XNet payload specifying the prefix of the reverse stream that the sending subnet can drop. When a node from subnet X fetches XNet payload from a node from subnet Y, in addition to actual messages, the Y node includes the stream header. The stream header describes the state of the X ↔ Y communication:

Signals solve the issue of collecting obsolete messages, but they introduce another problem now we also need to garbage-collect signals! Luckily, we already have all the information we need: we keep signals only for messages that are still present in the reverse stream. Once we notice (by looking at the range of message indices) that the remote subnet dropped messages from its stream, we can remove the corresponding signals from our header.

From my experience, the interplay of message indices and signals is notoriously hard to grasp, so let us look at an example. The diagram below depicts two subnets, X and Y, caught in the middle of communication. Subnet X gets a prefix of Y's stream and the header, allowing X to induct new messages and garbage-collect its messages and signals. X also publishes signals for the newly received messages and updates its indices accordingly so that Y could garbage-collect its messages and signals in the next round of communication.

Two subnets, X and Y, communicating over the XNet protocol. In previous communications, subnet Y consumed messages [0,10) from X's stream and has recently received messages 10 and 11. Now, subnet X receives a prefix of Y's stream and the matching header and removes messages 10 and 11 from its stream because Y will not need them anymore. X also includes signals for the newly received messages into the X → Y stream header and removes an obsolete signal for message Y1 it consumed before.

Stream Y → X
Y2
Y2
Y3
Y3
Y4
Y4
Stream X → Y
X12
X12
X13
X13
X10
X10
X11
X11
Y → X header
indices = [2, 5)
indices = [2, 5)
signals = { 10: ACK, 11: ACK }
signals = { 10: ACK, 11: ACK }
X → Y header
indices = [10, 14)
indices = [10, 14)
signals = { 1: ACK }
signals = { 1: ACK }
XNet
XNet
Stream Y → X prefix
Y2
Y2
Y3
Y3
Y → X header
indices = [2, 5)
indices = [2, 5)
signals = { 10: ACK, 11: ACK }
signals = { 10: ACK, 11: ACK }
XNet payload
XNet paylo...
Stream X → Y
X12
X12
X13
X13
X → Y header
signals = { 2: ACK, 3: ACK}
signals = { 2: ACK, 3: ACK}
indices = [12, 14)
indices = [12, 14)
State Machine
State Machine
Text is not SVG - cannot display

Signals and stream messages are like snakes eating each other's tails on the Auryn: the sender drops messages when it sees signals, and the receiver drops signals when it sees stream bounds advancing.

Stream certification

Since no two nodes in the network can blindly trust each other, we need a mechanism to validate the contents of XNet payloads. More specifically, we want proof that the stream prefix in the payload is the same on all honest nodes in the subnet that produced the stream. In the previous article, we have seen how nodes use threshold signatures on state trees as authenticity proofs for call replies. The XNet protocol relies on the same trick: nodes include streams destined for other subnets into their state trees and use threshold signatures as proofs of authenticity for nodes from other subnets.

The encoding of subnet message streams in the state tree.

header
header
indices
signals
indices...
<subnet_id>
<subnet_id>
<pruned>
<pruned>
<pruned>
<pruned>
stream
stream
messages
messages
M12
M12
<pruned>
<pruned>
12
12
M13
M13
13
13
Text is not SVG - cannot display

Conceptually, a node requesting messages from a subnet is not different from a client requesting a response. This similarity allows us to use the sameCurrently, there is a minor difference in how we represent certificates in the XNet protocol and the IC HTTP interface. The HTTP interface certificates combine the data and the hashes required to check the authenticity in a single data structure, the hash tree. The XNet protocol separates the raw message data and the hashes (called the witness) into separate data structures. The reason for that distinction is purely historical. We implemented the XNet protocol significantly earlier than the response authentication, so we could not benefit from Joachim Breitner's brilliance. Joachim took the XNet authentication scheme as the starting point and simplified it for the IC Interface Specification. authentication mechanism in both cases.

The tree structure of the certificate comes in handy for buffering messages on the receiver: the receiver can maintain a slightly larger pool of messages than a single block can fit, asynchronously fetching new messages and appending them to the pool, merging the certificates appropriately. This optimization allows us to use the subnet's narrow XNet channel more efficiently and improve fairness when we include messages from multiple subnets into a single block.

Limitations

The XNet protocol is a marvelous feat of engineering, but it has a few inherent limitations:

Code references

A few pointers to the code implementing designs from this article:

Credits

Finally, I shall give due credit to people who played an essential role in the protocol development.

My main contributions are stream certification and TLS integration.