Transposing tensor files

  2024-11-22

Here is a problem related to yours and solved before. Could you use it?

G. Pólya, How to Solve It, second edition, p. 9

I recently spent much time working with machine learning serialization formats, especially onnx. This file format uses Protocol Buffers for its binary representation and inherits the two-gigabyte restriction on the file format size. Bypassing this restriction requires storing raw tensor bytes in another file and referencing them from the onnx file.

But what should the tensor file format be? The safetensors library from Huggingface is popular for representing tensors on disk, and its data layout is fully compatible with the onnx raw tensor data format.

This article describes the safetensors file structure, points out its minor design flaws, and explains how changing the metadata location can address them.

The safetensors file format

A safetensors file stores a collection of multi-dimensional arrays. Its first eight bytes indicate the header size as an unsigned 64-bit integer, follows the header describing each tensor’s type and shape, and then comes the data section containing flat arrays.

The structure of a safetensors file. The first eight bytes indicate the header size in bytes. The header is a json object describing the tensor metadata. The last section contains raw array elements.

The header is a json object, where the key is the tensor name, and the value is an object describing the tensor shape, element type, and offsets from the start of the data section.

An example of the safetensors file header. The header is a json object mapping tensor names to their metadata: shape, element type, and offsets from the beginning of the data section.
{ "fc.weight": {
    "dtype": "F32",
    "shape": [10, 784],
    "offsets": [0, 31360]
  },
  "fc.bias": {
    "dtype": "F32",
    "shape": [10],
    "offsets": [31360, 31400]
  }
}

This simple file organization makes it easy to implement and use in most environments. Unfortunately, it also has a few flaws:

  1. Constructing a safetensors file requires two passes over the dataset: one pass to gather the tensor metadata and write the header and another to append the raw tensor data to the file.
  2. The tensor data offsets described in the metadata section are relative to the data section, not absolute within a file. This design choice makes working with these files more cumbersome: We must add the header size (and the size of this size, eight bytes) to tensor offsets before we can read the tensor data. The reason for this inconvenience is a chicken-and-egg problem: the absolute offsets depend on the header size, and the header size depends on the offsets. Safetensors designers used relative offsets in the header to break this cycle.
  3. The work required to add a new tensor or change the header is proportional to the entire file size, not the change size.

Tensor safes

I like to imagine a tensor file as a safe with gold ingots (tensors) inside. The metadata section is a slip of paper describing each ingot’s weight, purity, and location in the safe. The safetensors layout requires us to fill out the paper first, place it at the back of the safe, and assemble the ingots as described.

Luckily, there is an easier way. Put the ingots first, jotting down the size and location on the paper as you go. Once all the gold is in the safe, put the paper in front of it and seal it. In the world of binary files, this idea corresponds to placing the metadata block at the end of the file I learned this trick from the LevelDB table format. . Let’s call this derived format tensorsafe.

The format makes two minor adjustments to the safetensors structure:

  1. The metadata block lives at the end of the file, followed by the metadata size I limited the metadata section size to four bytes on the figure, enough to store more than 10,000,000 tensors in a single file, assuming a single entry is under 400 bytes. .
  2. The file starts with a fixed-size header containing magic bytes and the version. All self-respecting file formats need versioning.

The structure of a tensorsafe file. The header has a fixed size and includes magic bytes and the version. The variable-size metadata block moved to the end of the file.

These changes address all my issues with safetensors:

  1. The encoder needs only one pass over the data. It can accumulate the metadata while writing tensors to the file and add the metadata section at the end.
  2. The metadata section becomes self-contained and can use absolute offsets for tensor boundaries. The reader doesn’t need to massage the offsets anymore.
  3. We don’t have to move the data to append new tensors: We can write over the old metadata section and append the new one.

Are there any new downsides to the tensorsafe approach? None that I can think of.

Alternative designs

Dumping the entire metadata section at the beginning or the end of a file are not the only options available. This section explores other popular approaches for metadata encoding, such as spreading it across the file or making it float freely.

Chunked metadata

We can represent a collection of items by packaging each item’s metadata and data into a distinct chunk. For example, png encodes various aspects of the image as separate chunks, and the WebAssembly binary format represents a module as a collection of sections (types, imports, memory, etc.).

When we apply this approach to tensor encoding, we interleave tensor attributes with the tensor data.

The chunked metadata approach: the tensor file contains a self-contained section for each tensor.

This approach also addresses the original design issues:

Unfortunately, this design also has a severe disadvantage: The decoder has to scan the entire file even if the caller is interested in accessing only a specific tensor. Furthermore, metadata decoding becomes slower because it involves many file seeks.

Floating metadata

We explored options where metadata occupies the file’s beginning and end and is spread thinly across it. Are there any more options? There are! The metadata block can float freely within the file I first encountered this idea when dealing with wad files. I later learned that pdf uses a similar trick for its cross-reference tables. .

With this design, a fixed-size file header contains an offset of the metadata block location. The encoder writes a default header at first, encodes the entire dataset, appends the metadata block, and finally goes back to the header to update the metadata offset.

The floating metadata approach: a fixed-size header contains the metadata offset.

This design has all the benefits of the tensorsafe format, and the extra level of indirection adds another feature: atomic in-place updates. You can add, remove, or edit tensors using the following procedure:

  1. Append the new data and a new metadata section to the file.
  2. Sync the changes to disk.
  3. Update the header to point to the new metadata entry.

This approach guarantees that the file will stay consistent even if the writing process crashes before completing the update, but it can lead to extra space usage. The writer can detect and reuse that space in subsequent updates.

Given modern advances in hardware and file systems, atomic updates might not be worth the extra complexity.

Conclusion

This article covered a few alternative designs for the safetensors file format and argued that moving the metadata section at the end of the file would make the format easier to use. The following table summarizes the design space.

safetensors tensorsafe chunked floating
Zero-copy decoding
Fast metadata decoding
Data passes required for encoding 2 1 1 1
Absolute file offsets in metadata n/a
Atomic in-place updates

Appendix: binary format checklist

Similar articles