Effective Rust canisters

✏ 2021-10-25 ✂ 2022-10-16

How to read this document
Code organization
Optimization
- Reducing cycle consumption
- Reducing module size
Infrastructure
References
Changelog

How to read this document

This document is a compilation of useful patterns and typical pitfalls I observed in the Rust code running on the Internet Computer (IC). Take everything in this document with a grain of salt: solutions that worked well for my problems might be suboptimal for yours. Every piece of advice comes with explanations to help you form your judgment.

Some recommendations might change if I discover better patterns or if the state of the ecosystem improves. I will try to keep this document up to date.

Code organization

Canister state

The standard IC canister organization forces developers to use a mutable global state. Rust intentionally makes using global mutable variables hard, giving you a few options for organizing your code. Which option is the best?

☛Use thread_local! with Cell/RefCell for state variables.

This option is the safest. It will help you avoid memory corruption and issues with asynchrony.

thread_local! {
    static NEXT_USER_ID: Cell<u64> = Cell::new(0);
    static ACTIVE_USERS: RefCell<UserMap> = RefCell::new(UserMap::new());
}

Let us look at other options you might find in the wild and see what is wrong with them.

```
let state = ic_cdk::storage::get_mut<MyState>;
```
The Rust CDK used to provide a storage abstraction that allows you to get a value indexed by a type. The interface lets you obtain multiple non-exclusive mutable references to the same object, breaking language guarantees. The CDK team removed the storage API in version 0.5.0.
```
static mut STATE: Option<State> = None;
```
Plain old global variables. This approach forces you to write boilerplate code to access the global state and suffers from the same safety issues as legacy storage API.
```
lazy_static! {
    static STATE: RwLock<MyState> = RwLock::new(MyState::new());
}
```
This approach is memory-safe, although I find it confusing. Canisters cannot run multiple threads, so it is not obvious what happens if you try to obtain a lock twice. A failure to obtain a lock makes your canister trap, not block, as you might expect on most platforms. This distinction means your program’s meaning changes depending on the compilation target (wasm32-unknown-unknown vs. native code). We do not want that.

Let us see how non-exclusive mutable references can lead to hard-to-track bugs.

#[update]
fn register_friend(uid: UserId, friend: User) -> Result<UserId, Error> {
    let mut_user_ref = storage::get_mut<Users>() ①
                           .find_mut(uid)
                           .ok_or(Error::NotFound)?;

    let friend_id = storage::get_mut<Users>().add_user(&friend); ②

    mut_user_ref.friends.insert(friend_id); ③

    Ok(friend_id)
}

The example shows a function that uses the storage API, but plain old mutable globals cause the same issue.

We get a mutable reference pointing into our data structure.
We call a function that modifies the data structure. This call might have invalidated the reference we obtained in step ①. The reference could now be pointing to garbage or into the middle of another valid object.
We use the original mutable reference to modify the object, potentially corrupting the heap.

The real-life code might be more complicated. The undesired mutation might happen deep in the function call stack. The issue can stay undetected until your canister is in active use, storing (and corrupting) user data. If we used a RefCell, the code would panic before we shipped it.

It should now be clear how to declare global variables. Let us discuss where to put them.

☛Put all your globals in one basket.

Consider making all the global variables private and placing them in a single file, the canister main file. This approach has a few benefits:

Testing becomes easier because most of your code does not touch the globals.
You can see global state usage patterns at a glance. For example, you can quickly validate that the canister persists all the stable data across upgrades.

Consider also adding comments clarifying which variables are stable, like in the following example:

thread_local! {
    /* stable   ① */ static USERS: RefCell<Users> = ... ;
    /* flexible ② */ static LAST_ACTIVE: Cell<UserId> = ...;
}

I borrowed Motoko terminology here:

The system preserves stable variables across upgrades. For example, a user database should probably be stable.
The system discards flexible variables on code upgrades. For example, you can make a cache flexible if it is not crucial for your canister.

If you have tried to test canister code, you probably noticed that this part of the development workflow is not polished yet. A quick way to make your life easier is to piggyback on the existing Rust infrastructure. This trick is possible only if you can compile the same canister code to a native target and WebAssembly.

☛Make most of the canister code target-independent.

It pays off to factor most of the canister code into loosely coupled modules and packages and to test them independently. Most of the code that depends on the System API should live in the main file.

You can also create thin abstractions for the System API and test your code with a fake but faithful implementation. For example, you could use the following trait to abstract the Stable Memory API

pub trait Memory {
    fn size(&self) -> WasmPages;
    fn grow(&self, pages: WasmPages) -> WasmPages;
    fn read(&self, offset: u64, dst: &mut [u8]);
    fn write(&self, offset: u64, src: &[u8]);
}

Asynchrony

If a canister traps or panics, the system rolls back the state of the canister to the latest working snapshot This system behavior is part of the orthogonal persistence feature. . If a canister makes a call and then traps in the callback, the canister might never release the resources allocated for the call.

☛Avoid panics after await.

Let us start with an example.

#[update]
async fn update_avatar(user_id: UserId, pic: ByteBuf ① ) {
    let key = store_async(user_id, &pic)
                  .await      ②
                  .unwrap();  ③
    USERS.with(|users| set_avatar_key(user_id, key));
}

The method receives a byte buffer with an avatar picture.
The method issues a call to the storage canister. The call allocates a future on the heap, capturing the byte buffer.
If the call fails, the canister panics. The system rolls back the canister state to the snapshot it created right before the callback invocation. From the canister’s point of view, it still waits for the reply and keeps the future and the buffer on the heap.

Note that there is no memory corruption. The canister is still in a valid state but will not release the buffer memory until the next upgrade.

The System API provides the ic0.call_on_cleanup function to address this issue. Rust CDK versions 0.5.1 and higher take advantage of this mechanism and release resources across await boundaries. I still recommend using explicit error handling instead of panics whenever possible.

Another problem you might experience with asynchrony and miss in tests is a future that has exclusive access to a resource for a long time.

☛Don’t lock shared resources across await boundaries.

#[update]
async fn refresh_profile_bad(user_id: UserId) {
   let users = USERS_LOCK.write().unwrap(); ①
   if let Some(user) = users.find_mut(user_id) {
       if let Ok(profile) = async_get_profile(user_id).await { ②
           user.profile = profile;
       }
   }
}

#[update]
fn add_user(user: User) {
    let users = USERS_LOCK.write().unwrap(); ③
    // …
}

We obtain exclusive access to the users map and make an async call.
The system commits the canister state after the call suspends. The user map stays locked.
Other methods accessing the map will panic until the call started in step ② completes.

This issue becomes quite nasty when combined with panics. If you lock a resource and panic after the await, the resource might stay locked foreverAs noted in the previous section, Rust CDK version 0.5.1 addresses this issue..

We’re now ready to appreciate another benefit of using thread_local! for global variables. The code above wouldn’t have compiled if we used thread_local!. You cannot await in closure accessing thread-local variables:

#[update]
async fn refresh_profile(user_id: UserId) {
    USERS.with(|users| {
        if let Some(user) = users.borrow_mut().find_mut(user_id) {
            if let Ok(profile) = async_get_profile(user_id).await {
                // The closure is synchronous, cannot await ^^^
                // …
            }
        }
    });
}

The compiler nudges you to write a less elegant but correct version:

#[update]
async fn refresh_profile(user_id: UserId) {
    if !USERS.with(|users| users.borrow().has_user(user_id)) {
        return;
    }
    if let Ok(profile) = async_get_profile(user_id).await {
        USERS.with(|users| {
            if let Ok(user) = users.borrow_mut().find_user(user_id) {
                user.profile = profile;
            }
        })
    }
}

Canister interfaces

Many people enjoy the Motoko compiler’s code-first approach: you write an actor with public functions, and the compiler automatically generates the corresponding Candid file. This feature is indispensable in the early stages of development.

Canister with clients should follow the reverse pattern: the Candid file should be the source of truth, not the canister implementation.

☛Make your .did file the source of truth.

Your Candid file is the primary documentation source for people interacting with your canister (including your team members working on the front end). The interface should be stable, easy to find, and well-documented.

type TransferError = variant {
  // The debit account didn't have enough funds
  // for completing the transaction.
  InsufficientFunds : Balance;
  // …
};

type TransferResult =
  variant { Ok : BlockHeight; Err : TransferError; };

service {
  // Transfer funds between accounts.
  transfer : (TransferArgs) -> (TransferResult);
}

The Candid package provides tools to help you keep your implementation and the public interface in sync:

Annotate your canister methods with the candid_method macro.
Use the export_service macro to extract your canister’s effective Candid interface.
Call the service_compatible function to check whether the effective interface is a subtype of the interface from the .did file.

use candid::candid_method;
use ic_cdk_macros::update;

#[update]
#[candid_method(update)] ①
async fn transfer(arg: TransferArg) -> Result<Nat, TransferError> {
  // …
}

#[test]
fn check_candid_interface() {
  use candid::utils::{service_compatible, CandidSource};
  use std::path::Path;

  candid::export_service!(); ②
  let new_interface = __export_service();

  service_compatible( ③
    CandidSource::Text(&new_interface),
    CandidSource::File(Path::new("interface.did")),
  ).unwrap();
}

☛Use variant types to indicate error cases.

Just as Rust error types simplify error handling, Candid variants can help your clients gracefully handle edge cases. Variant types are also the preferred way of reporting errors in Motoko.

type CreateEntityResult = variant {
  Ok  : record { entity_id : EntityId; };
  Err : opt variant {
    EntityAlreadyExists : null;
    NoSpaceLeftInThisShard : null;
  }
};

service : {
  create_entity : (EntityParams) -> (CreateEntityResult);
}

Note that even if a service method returns a result type, it can still reject the call. There is not much benefit from adding error variants such as InvalidArgument or Unauthorized. There is no meaningful way to recover from such errors programmatically. In most cases, rejecting malformed, invalid, or unauthorized requests is the right thing to do.

So you followed the advice and represented your errors as a variant. How do you add more error constructors as your interface evolves?

☛Make your variant types extensible.

Candid variant types are tricky to evolve in a backward-compatible manner. One approach is to make the variant field optional:

type CreateEntityResult = variant {
  Ok : record { /* */ };
  Err : opt variant { /* * /}
};

If some clients of your canister use an outdated version of your interface, the Candid decoder could replace unknown constructors with a null. This approach has two main issues:

The Candid decoder does not yet implement this magic (see dfinity/candid#295).
Diagnosing a problem if all you see is null is daunting.

An alternative is to make your error type immutable and rely on a loosely typed catch-all case (and documentation) for extensibility.

type CreateEntityResult = variant {
  Ok : record { /* */ };
  Err : variant {
    EntityAlreadyExists : null;
    NoSpaceLeftInThisShard : null;
    // Currently defined errors
    // ========================
    // error_code = 401 : Unauthorized.
    // error_code = 429 : Too many requests.
    // error_code = 503 : Canister overloaded.
    Other : record { error_code : nat; error_message : text }
  }
};

If you follow this approach, your clients will see a nice textual description if they experience a newly introduced error. Unfortunately, programmatically handling generic errors is more cumbersome and error-prone than well-typed extensible variants.

Optimization

Reducing cycle consumption

The first step towards an optimized system is profiling.

☛Measure the number of instructions your endpoints consume.

The instruction_counter API will tell you the number of instructions your code consumed since the last entry point. Instructions are the internal currency of the IC runtime. One IC instruction is the quantum of work that the system can do, such as loading a 32-bit integer from a memory address. The system assigns an instruction cost equivalent to each WebAssembly instruction and system call. It also defines all its limits in terms of instructions. As of July 2022, these limits are:

One message execution: 5 billion instructions.
One roundEach block produced by consensus initiates a round of execution.: 7 billion instructions.
Canister upgrade: 200 billion instructions.

Instructions are not cycles, but there is a simple linear function that converts instructions to cycles. As of July 2022, ten instructions are equivalent to four cycles on an application subnet.

Note that the value that performance_counter returns has meaning only within a single execution. You should not compare values of the instruction counter measured across async boundaries.

#[update]
async fn transfer(from: Account, to: Account, amount: Nat) -> Result<TxId, Error> {
  let start = ic_cdk::api::instruction_counter();

  let tx = apply_transfer(from, to, amount)?;
  let tx_id = archive_transaction(tx).await?;

  // BAD: the await point above resets the instruction counter.
  let end = ic_cdk::api::instruction_counter();
  record_measurement(end - start);

  Ok(tx_id)
}

☛Encode byte arrays using the serde_bytes package.

Candid is the standard interface definition language on the IC. The Rust implementation of Candid relies on a popular serde framework and inherits all of serde’s quirks. One such quirk is the inefficient encoding of byte arrays (Vec<u8> and [u8]) in most serialization formats. Due to Rust limitations, serde cannot treat byte arrays specially and encodes each byte as a separate element in a generic array, increasing the number of instructions required to encode or decode the message (often by a factor of ten or more).

The HttpResponse from the canister http protocol is a good example.

#[derive(CandidType, Deserialize)]
struct HttpResponse {
    status_code: u16,
    headers: Vec<(String, String)>,
    // BAD: inefficient
    body: Vec<u8>,
}

The body field can be large; let us tell serde to encode this field more efficiently using the with attribute.

#[derive(CandidType, Deserialize)]
struct HttpResponse {
    status_code: u16,
    headers: Vec<(String, String)>,
    // OK: encoded efficiently
    #[serde(with = "serde_bytes")]
    body: Vec<u8>,
}

Alternatively, we can use the ByteBuf type for this field.

#[derive(CandidType, Deserialize)]
struct HttpResponse {
    status_code: u16,
    headers: Vec<(String, String)>,
    // OK: also efficient
    body: serde_bytes::ByteBuf,
}

I wrote a tiny canister to measure the savings.

⊕ A canister endpoint measuring the number of instructions required to encode an HTTP response. We have to use a ManualReply to measure the encoding time.

#[query(manual_reply = true)]
fn http_response() -> ManualReply<HttpResponse> {
    let start = ic_cdk::api::instruction_counter();
    let reply = ManualReply::one(HttpResponse {
        status_code: 200,
        headers: vec![("Content-Length".to_string(), "1000000".to_string())],
        body: vec![0; 1_000_000],
    });
    let end = ic_cdk::api::instruction_counter();
    ic_cdk::api::print(format!("Consumed {} instructions", end - start));
    reply
}

The unoptimized version consumes 130 million instructions to encode one megabyte, and the version with serde_bytes needs only 12 million instructions.

In the case of the Internet Identity canister, this change alone reduced the instruction consumption in HTTP queries by order of magnitude. You should apply the same technique for all types deriving serde’s Serialize and Deserialize traits, not just for types you encode as Candid. A similar change boosted the ICP ledger archive upgrades (the canister uses CBOR for state serialization).

☛Avoid copying large values.

Experience shows that canisters spend a lot of their instructions copying bytes Spending a lot of time in memcpy and memset is a common trait of many WebAssembly programs. That observation led to the bulk memory operations proposal included in the WebAssembly 2.0 release.. Reducing the number of unnecessary copies often affects cycle consumption.

Let us imagine that we work on a canister that serves a single dynamic asset.

thread_local!{
    static ASSET: RefCell<Vec<u8>> = RefCell::new(init_asset());
}

#[derive(CandidType, Deserialize)]
struct HttpResponse {
    status_code: u16,
    headers: Vec<(String, String)>,
    #[serde(with = "serde_bytes")]
    body: Vec<u8>,
}

#[query]
fn http_request(_request: HttpRequest) -> HttpResponse {
    // NOTE: we are making a full copy of the asset.
    let body = ASSET.with(|cell| cell.borrow().clone());

    HttpResponse {
        status_code: 200,
        headers: vec![("Content-Length".to_string(), body.len().to_string())],
        body
    }
}

The http_request endpoint makes a deep copy of the asset for every request. This copy is unnecessary because the CDK encodes the response into the reply buffer as soon as the endpoint returns. There is no need for the encoder to own the body. The current macro API makes it unnecessarily hard to eliminate copies: the type of reply must have 'static lifetime. There are a few ways to work around this issue.

One solution is to wrap the asset body into a reference-counting smart pointer.

⊕ Using a reference-counting pointer for large values. Note that the type of the ASSET variable has to change: all copies of the data must be behind the smart pointer.

thread_local!{
    static ASSET: RefCell<RcBytes> = RefCell::new(init_asset());
}

struct RcBytes(Arc<serde_bytes::ByteBuf>);

impl CandidType for RcBytes { /* */ }
impl Deserialize for RcBytes { /* */ }

#[derive(CandidType, Deserialize)]
struct HttpResponse {
    status_code: u16,
    headers: Vec<(String, String)>,
    body: RcBytes,
}

With this approach, you can save on copies without changing the overall structure of your code. A similar change cut instruction consumption in the certified assets canister by 30%.

Another solution is to enrich your types with lifetimes and use the ManualReply API.

use std::borrow::Cow;
use serde_bytes::Bytes;

#[derive(CandidType, Deserialize)]
struct HttpResponse<'a> {
    status_code: u16,
    headers: Vec<(Cow<'a, str>, Cow<'a, str>)>,
    body: Cow<'a, serde_bytes::Bytes>,
}

#[query(manual_reply = true)]
fn http_response(_request: HttpRequest) -> ManualReply<HttpResponse<'static>> {
    ASSET.with(|asset| {
        let asset = &*asset.borrow();
        ic_cdk::api::call::reply((&HttpResponse {
            status_code: 200,
            headers: vec![(
                Cow::Borrowed("Content-Length"),
                Cow::Owned(asset.len().to_string()),
            )],
            body: Cow::Borrowed(Bytes::new(asset)),
        },));
    });
    ManualReply::empty()
}

This approach allows you to get rid of all the unnecessary copies, but it complicates the code significantly. You should prefer the reference-counting approach unless you have to work with data structures that already have explicit lifetimes (HashTree from the ic-certified-map package is a good example).

I experimented with a one-megabyte asset and measured that the original code relying on a deep copy consumed 16 million instructions. At the same time, versions with reference counting and explicit lifetimes needed only 12 million instructions The 25% improvement shows that our code does little but copy bytes. The code did at least three copies: ① from a thread_local to an HttpResponse, ② from the HttpResponse to candid’s value buffer, and ③ from candid’s value buffer to the call’s argument buffer. We removed ⅓ of copies and got ¼ improvement in instruction consumption. So only ¼ of our instructions contributed to work unrelated to copying the asset’s byte array. .

Reducing module size

By default, cargo spits out huge WebAssembly modules. Even the tiny counter canister compiles to a whopping 2.2MiB monster under the default cargo release profile. This section presents simple techniques for reducing canister sizes.

☛Compile canister modules with size and link-time optimizations.

The code that the Rust compiler considers fast is not always the most compact code. We can ask the compiler to optimize our code for size with the opt-level = 'z' option. Unfortunately, that option alone does not affect the counter canister module size.

Link-time optimization is a more aggressive option that asks the compiler to apply optimizations across module boundaries. This optimization slows down the compilation but its ability to prune unnecessary code is crucial for obtaining compact canister modules. Adding lto = true to the build profile shrinks the counter canister module from 2.2MiB to 820KiB. Add the following section to the Cargo.toml file at the root of your Rust project to enable size optimizations:

[profile.release]
lto = true
opt-level = 'z'

Another option you can play with is codegen-units. Decreasing this option reduces the parallelism in the code generation pipeline but enables the compiler to optimize even harder. Setting codegen-units = 1 in the cargo release profile shrinks the counter module size from 820KiB to 777KiB.

☛Strip off unused custom sections.

By default, the Rust compiler emits debugging information allowing tools to link back WebAssembly instructions to source-level constructs such as function names. This information spans several custom WebAssembly sections that the Rust compiler attaches to the module. Currently, there is no use for debugging information on the IC. You can safely remove unused sections using the ic-wasm tool.

$ cargo install ic-wasm
$ ic-wasm -o counter_optimized.wasm counter.wasm shrink

The ic-admin shrink step shrinks the counter canister size from 820KiB to 340KiB. ic-wasm is clever enough to preserve custom sections that the IC understands.

☛Use the twiggy tool to find the source of code bloat.

Some Rust language design choices (for example, monomorphization) trade execution speed for binary size. Sometimes changing the design of your code or switching a library can significantly reduce of the module size. As with any optimization process, you need a profiler to guide your experiments. The twiggy twiggy needs debug info to display function names. Run it before you shrink your module with ic-wasm. tool is excellent for finding the largest functions in your WebAssembly modules.

⊕ Top contributors to the size of the WebAssembly module of the counter canister. Custom sections with debugging information dominate the output, but we have to keep these sections to see function names in twiggy’s output. Serde-based candid deserializer is the worst offender when it comes to code size.

  $ cargo install twiggy
  $ twiggy top -n 12 counter.wasm
 Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
        130610 ┊    16.42% ┊ custom section '.debug_str'
        101788 ┊    12.80% ┊ "function names" subsection
         75270 ┊     9.46% ┊ custom section '.debug_info'
         60862 ┊     7.65% ┊ custom section '.debug_line'
         52522 ┊     6.60% ┊ data[0]
         46581 ┊     5.86% ┊ custom section '.debug_pubnames'
         34800 ┊     4.38% ┊ custom section '.debug_ranges'
         15721 ┊     1.98% ┊ <&mut candid::de::Deserializer as serde::de::Deserializer>::deserialize_any::h6f19d3c43b6b4e95
         12878 ┊     1.62% ┊ <candid::binary_parser::ConsType as binread::BinRead>::read_options::hb957a7f286706947
         12546 ┊     1.58% ┊ candid::de::IDLDeserialize::new::h3afa758d80a71068
         11974 ┊     1.51% ┊ <&mut candid::de::Deserializer as serde::de::Deserializer>::deserialize_ignored_any::hb61449316ff3dae4
          9015 ┊     1.13% ┊ core::fmt::float::float_to_decimal_common_shortest::h1e6cfda96af3f1c0
        230729 ┊    29.01% ┊ ... and 1195 more.
        795296 ┊   100.00% ┊ Σ [1207 Total Rows]

Once you have identified the library that contributes to the code bloat the most, you can try to find a less bulky alternative. For example, I shrank the ICP ledger canister module by 600KiB by switching from serde_cbor to ciborium for CBOR deserialization.

☛GZip-compress canister modules.

The IC has the concept of a canister module, the equivalent of an executable file in operating systems. Starting from version 0.18.4 of the IC specification, canister modules can be not only binary-encoded WebAssembly files but also GZip-compressed WebAssembly files.

For typical WebAssembly files that do not embed compressed assets, GZip-compression can often cut the module size in half. Compressing the counter canister shrinks the module size from 340KiB to 115KiB (about 5% of the 2.2MiB module we started with!).

Infrastructure

Builds

People using your canister might want to verify that it does what it claims to do (especially if the canister moves people’s tokens around). The Internet Computer allows anyone to inspect the sha256 hash sum of the canister WebAssembly module. However, there are no good tools yet to review the canister’s source code. The developer is responsibile for providing a reproducible way of building a WebAssembly module from the published source code.

☛Make canister builds reproducible.

Getting a reproducible build by chance is about as likely as constructing a living organism by throwing random molecules together. At least two popular technologies can help you make your builds more reproducible: Linux containers and Nix. Containers are a more mainstream technology and are usually easier to set up, but Nix also has its share of fans. In my experience, Nix builds tend to be more reproducible. Use the technology with which you are most comfortable. It is the result that matters.

It also helps if you build your module using a public Continuous Integration system, making it easy to follow the module build steps and download the final artifact.

Finally, if your code is still evolving, make it easy for people to correlate module hashes with source code versions. You can mention the module hash in release notes, for example.

Read the Reproducible Canister Builds article for more advice on reproducible builds.

Upgrades

Let me remind you how upgrades work:

The system calls the pre_upgrade hook if your canister defines it.
The system discards canister memory and instantiates the new version of your module. The system preserves stable memory and makes it available to the next version.
The system calls the post_upgrade hook on the newly created instance if your canister defines it. The system does not execute the init function.

If the canister traps in any of the steps above, the system reverts the canister to the pre-upgrade state.

☛Plan for upgrades from day one.

You can live without upgrades during the initial development cycle, but even then losing state on each test deployment becomes annoying quickly. As soon as you deploy your canister to the mainnet, the only way to ship new code versions is to plan the upgrades carefully.

☛Version your stable memory.

You can view stable memory as a communication channel between your canister’s old and new versions. All proper communication protocols have a version. One day, you might want to change the stable data layout or serialization format radically. The code becomes messy and brittle if the stable memory decoding procedure needs to guess the data format.

Save your nerve cells and think about versioning in advance. It is as easy as declaring, the first byte of my stable memory is the version number.

☛Always test your upgrade hooks.

Testing upgrades is crucial. If they go wrong, you can lose your data irrevocably. Make sure that upgrade tests are an integral part of your infrastructure.

One approach to testing upgrades is to add an extra optional upgrade step before you execute the state validation part of your test. The following pseudo-code is in Rust, but the idea does not depend on the language.

let canister_id = install_canister(WASM);
populate_data(canister_id);
if should_upgrade { upgrade_canister(canister_id, WASM); }
let data = query_canister(canister_id);
assert_eq!(data, expected_value);

You then run your tests twice in different modes:

In the no upgrades mode, your tests run without executing any upgrades.
In the upgrade mode, your tests self-upgrade the canister before each assertion.

This pattern can give you some confidence that canister upgrades preserve the state: the users cannot tell whether there was an upgrade or not. Testing that you can safely upgrade the canister from the previous version is also a good idea.

☛Do not trap in the pre_upgrade hook.

The pre_upgrade and post_upgrade hooks appear to be symmetrical. The canister returns to the pre-upgrade state if either of these hooks traps. This symmetry is deceptive.

The hope is not lost if your pre_upgrade hook succeeds but the post_upgrade hook traps. You can figure out what went wrong and build another version of your canister that will not trap on upgrade. You might need to devise a complex multi-stage upgrade procedure, but at least there is a way out.

On the other hand, if your pre_upgrade hook traps, there is not much you can do about it. Changing canister behavior needs an upgrade, but that is what a broken pre_upgrade hook prevents you from doing.

The pre_upgrade hook will not let you down if you do not have one. The following advice will help you get rid of that hook.

☛Consider using stable memory as your main storage.

There is a cap on how many cycles a canister can burn during an upgrade. If your canister exceeds that limit, the system cancels the upgrade and reverts the canister state. If you serialize your whole state to stable memory in the pre_upgrade hook and the state grows large, you might not be able to upgrade your canister again.

One way of dealing with this issue is not to serialize the entire state in one go. You can use stable memory as your disk store, updating it incrementally with every update call. This way, you might not need the pre_upgrade hook, and your post_upgrade hook will burn few cycles.

There are a few downsides to this approach, however:

Organizing the flat address space of stable storage into a data structure is challenging, especially if your state consists of several interlinked data structures. The ic-stable-structures and ic-stable-memory packages attempt to alleviate the pain.
Changing the layout of your data might be infeasible. It will simply be too much work for a canister to complete the data migration in one go. Imagine writing a program that reformats an eight-gigabyte disk from fat32 to ntfs without losing any data. By the way, that program must complete in under 5 seconds.
You must think carefully about the backward compatibility of your data structures. The latest version of your canister might have to read data that the version installed a few months ago wrote.

There is a tough trade-off between service scalability and code simplicity. If you plan to store gigabytes of state and upgrade the code, consider using stable memory as the primary storage.

Observability

At dfinity, we use metrics extensively and monitor all our production services. Metrics are indispensable for understanding the behaviors of a complex distributed system. Canisters are not unique in this regard.

☛Expose metrics from your canister.

Let us look at two specific approaches you can take.

Expose a query call returning a data structure containing metrics. If you do not want to make the metrics public, you can reject queries based on the caller’s principal. The main benefit of this approach is that the response is highly structured and easy to parse. I often use this approach in integration tests.
```
pub struct MyMetrics {
  pub stable_memory_size: u32,
  pub allocated_bytes: u32,
  pub my_user_map_size: u64,
  pub last_upgraded_ts: u64,
}

#[query]
fn metrics() -> MyMetrics {
  check_acl();
  MyMetrics {
    // ...
  }
}
```

Expose the metrics in a format that your monitoring system can slurp through the canister HTTP gateway. For example, we use Prometheus for monitoring, so our canisters dump metrics in Prometheus text-based exposition format.

fn http_request(req: HttpRequest) -> HttpResponse {
  match path(&req) {
    "/metrics" => HttpResponse {
        status_code: 200,
        body: format!(r#"stable_memory_bytes {}
                         allocated_bytes {}
                         registered_users_total {}"#,
                      stable_memory_bytes, allocated_bytes, num_users),
        // ...
    }
  }
}

You do not have to link any heavy libraries, the format is brutally simple. The format macro will do if you need only simple counters and gauges. Histograms and labels require a bit more work, but you can get quite far with simple tools.

Some things you might want to keep an eye on:

The size of stable memory.
The size of the objects allocated on the heap (this size is relatively easy to get if you define a custom allocator).
The lengths of internal data structures (queques, maps, etc.).
The canister’s cycle balance.
The time of the last canister upgrade.

References

The following are some of the heavily used Rust canisters for inspiration:

Internet Identity Backend is an excellent example of a canister that uses stable memory as the primary storage, obtains secure randomness from the system, and exposes Prometheus metrics.
Certified Assets Canister is an example of a canister that produces certified HTTP responses.

Changelog

2022-10-16	Another complete editorial pass. Mentioned a few CDK improvements.
2022-07-18	Add a new section on canister optimization.
2022-02-19	Add notes on candid variant extensibility and panics in upgrade hooks.
2021-10-25	The first version.

Similar articles