Interleaved deltas



A weave has a reputation of one of the most reinvented data structures in history.

Victor Grishchenko, Mikhail Patrakeev, Chronofold: a data structure for versioned text

Progress is often viewed as a transition from simple to complex. Although individual systems do become more complex over time, long-term progress happens when new ideas and advances in hardware make simple designs feasible.

Think of the simplest version control system you can imagine. It probably stores project snapshots in a backup directory. Maybe it compresses files for efficiency. Add hashing for content addressing, and the sketch starts to resemble Git. I don’t mean this as an insult to Git; quite the opposite:

the structure of a Git repository is stunningly simple. It’s so simple in fact, that you’ll be surprised at the sophistication of the things you can do with it. But that’s the way the best code will always be: simple, solid premises out of which complex applications arise.

Wincent Colaiuta, A look back: Bram Cohen vs Linus Torvalds

The Source Code Control System (sccs), developed by Marc J. Rochkind in the early seventies, had to be more sophisticated. It operated on one file at a time and stored all its revisions in a data structure called interleaved deltas or weaves.

If Git felt like something I could have invented myself, weaves captivated and intimidated me for years. When I finally found time to work out the algorithms, the resources were scarce BitKeeper wiki offers a good introduction. Bram Cohen wrote a reference implementation, but I found it impenetrable. , so I published my study for future explorers. You can find the full source code at roman-kashitsyn/weaver.

Weave structure

A weave is a sequence of instructions for reconstructing file revisions. An instruction consists of a type and an operand.

Type Operand Meaning
Line Line index Output a line.
BeginInsert Version number Start a block of lines added in a version.
BeginDelete Version number Start a block of lines deleted in a version.
End Version number Close a block for a version.
The weave instruction type.
type InstructionType int

const (
	Line InstructionType = iota
	BeginInsert
	BeginDelete
	End
)

// Instruction is a weave instruction.
type Instruction struct {
	Type InstructionType
	// Payload contains the instruction payload.
	// For Line instructions, it's the line number.
	// For other instructions, it's the version ID.
	Payload int
}

func (i Instruction) Line() int {
	return i.Payload
}

func (i Instruction) VersionID() VersionID {
	return VersionID(i.Payload)
}

All line instructions must belong to an Insert or Delete block. The integer in the line instruction points into the global line pool. Using line indices instead of strings shrinks the weave and makes instructions uniform.

Unlike in xml or json, weave blocks can overlap. For example, if v1 adds lines 1 and 2, v2 appends lines 3–6, and v3 removes lines 2–4, then v3’s delta overlaps with deltas of the first two versions.

   ⎧ Insert v1
   ⎪ 1
Δ1 ⎨ Delete v3  ⎫
   ⎪ 2          ⎪
   ⎩ End v1     ⎪
   ⎧ Insert v2  ⎬ Δ3
   ⎪ 3          ⎪
   ⎪ 4          ⎪
Δ2 ⎨ End v3     ⎭
   ⎪ 5
   ⎪ 6
   ⎩ End v2

Versions in a weave depend on one another. In the example above, v3 depends on both v1 and v2 because it deletes lines that they introduced. The relationship between v1 and v2 is ambiguous: v1 could be v2’s parent, or they could be independent branches that v3 merged.

The weave alone is not enough to deduce dependencies between deltas. Most weave-based systems express these dependencies using activation sets.

Active revision set

sccs has hierarchical revision names (1.4 and 2.1, for example). When you check out a revision, the system uses heuristics to compute its activation set—the collection of deltas that contribute to the file content.

I think of activation sets as switches controlling streetlamps on a dark avenue. Flipping them reveals or hides parts of the weave.

For my experiment, I chose flat sequential revision numbering. Each revision can have one or more parents Revisions with multiple parents represent merge commits. . Parent revisions are numerically smaller than their children, so no version can be its own ancestor. Activating a revision also activates all of its parents, so computing the active set requires a graph traversal.

An active set is a collection of deltas that together make a file revision. We compute it by traversing the version graph.
type VersionID int

type ActiveSet []bool

func (s ActiveSet) Contains(v VersionID) bool {
	return v >= 0 && int(v) < len(s) && s[v]
}

func (s ActiveSet) Deactivate(v VersionID) ActiveSet {
	out := slices.Clone(s)
	out[v] = false
	return out
}

type Version struct {
	ID          VersionID
	Parents     []VersionID
	Author      string
	Description string
	Date        time.Time
}

func activeSetForVersion(versions []Version, v VersionID) ActiveSet {
	activeSet := make(ActiveSet, len(versions)+1)
	stack := []VersionID{v}
	for len(stack) > 0 {
		w := stack[len(stack)-1]
		stack = stack[:len(stack)-1]
		activeSet[w] = true
		for _, p := range versions[w-1].Parents {
			if !activeSet[p] {
				stack = append(stack, p)
			}
		}
	}
	return activeSet
}

We’re now ready to tackle the version reconstruction algorithm.

Reconstructing revisions

To reconstruct a revision, we compute its active set and scan the instruction list, keeping track of currently open blocks in a priority queue. If the instruction opens an insert block—for any revision—or a delete block for an active revision, we push it onto the queue. We copy lines if the block at the top of the priority queue is active (that’s why we keep inactive inserts in the queue).

We need a priority queue because blocks can overlap (see the Weave structure section for an example). If blocks were properly nested, a stack would suffice.

My implementation uses a sorted slice as a priority queue for brevity (heaps require a lot of boilerplate in Go).

The Reconstruct function recovers the lines of any file version stored in a weave.
var (
	ErrBadWeave = errors.New("E001: malformed weave")
	ErrBadDelta = errors.New("E002: bad delta")
)

// Reconstruct locates the lines enabled in the given version set.
// It returns the boolean mask of enabled "Line" instructions
// and the IDs of versions that introduced the enabled lines.
//
// Returns [ErrBadWeave] if the instructions aren't a valid weave.
func Reconstruct(
	instructions []Instruction,
	activeSet ActiveSet,
) (mask []bool, versions []VersionID, err error) {
	mask = make([]bool, len(instructions))
	openBlocksByVersion := make(map[VersionID]int)
	// activeBlocks are sorted by VersionID (ascending).
	var activeBlocks []Instruction
	activeBlock := func(v VersionID) (int, bool) {
		return slices.BinarySearchFunc(activeBlocks, v,
			func(instr Instruction, target VersionID) int {
				return cmp.Compare(instr.VersionID(), target)
			})
	}

	for offset, instr := range instructions {
		switch instr.Type {
		case Line:
			if len(activeBlocks) == 0 {
				return nil, nil, ErrBadWeave
			}
			top := activeBlocks[len(activeBlocks)-1]
			v := top.VersionID()
			if activeSet.Contains(v) && top.Type == BeginInsert {
				mask[offset] = true
				versions = append(versions, v)
			}
		case BeginInsert, BeginDelete:
			v := instr.VersionID()
			if _, ok := openBlocksByVersion[v]; ok {
				return nil, nil, ErrBadWeave
			}
			if instr.Type == BeginInsert || activeSet.Contains(v) {
				at, _ := activeBlock(v)
				activeBlocks = slices.Insert(activeBlocks, at, instr)
			}
			openBlocksByVersion[v] = offset
		case End:
			v := instr.VersionID()
			if _, ok := openBlocksByVersion[v]; !ok {
				return nil, nil, ErrBadWeave
			}
			at, found := activeBlock(v)
			if found {
				activeBlocks = slices.Delete(activeBlocks, at, at+1)
			}
			delete(openBlocksByVersion, v)
		default:
			return nil, nil, ErrBadWeave
		}
	}
	if len(openBlocksByVersion) > 0 {
		return nil, nil, ErrBadWeave
	}
	return mask, versions, nil
}

Returning the bit mask of enabled line instructions (instead of their contents) simplifies many operations, including weave extension and delta reconstruction.

Before we can start adding changes to weaves, we need diffs.

Computing deltas

Given two sequences of items, we want to find a transformation that turns the first sequence into the second and uses the fewest steps.

The transformation is a sequence of deltas (also known as hunks) that specify an action (Insert, Delete, or Keep) and the items it acts upon. I called such a transformation a diff script.

The diff algorithm I’m most familiar with is the lcs algorithm See, for example, the first section of An Algorithm for Differential File Comparison by J. W. Hunt and M. D. McIlroy. . It computes an optimal diff script in two stages:

  1. Use dynamic programming to build the lcs table. The lcs[i][j] entry is the length of a longest common subsequence of a[i:] and b[j:]. Fill the table back-to-front in both dimensions.
  2. Traverse the lcs table to recover the actions, moving toward lcs increase when a and b don’t align.
The DiffScript function computes an optimal edit script for two sequences of integers.
type Action int

const (
	Insert Action = iota
	Delete
	Keep
)

type Delta struct {
	Action Action
	Items  []int
}

func appendDelta(script []Delta, action Action, item int) []Delta {
	n := len(script)
	if n > 0 && script[n-1].Action == action {
		script[n-1].Items = append(script[n-1].Items, item)
		return script
	}
	return append(script, Delta{Action: action, Items: []int{item}})
}

// DiffScript computes the optimal edit script for turning a into b.
func DiffScript(a, b []int) (script []Delta) {
	n, m := len(a), len(b)

	// Build a table of lengths of LCS(a[i:], b[j:]).
	lcs := make([][]int, n+1)
	for i := range lcs {
		lcs[i] = make([]int, m+1)
	}
	for i := range slices.Backward(a) {
		for j := range slices.Backward(b) {
			if a[i] == b[j] {
				lcs[i][j] = 1 + lcs[i+1][j+1]
			} else {
				lcs[i][j] = max(lcs[i+1][j], lcs[i][j+1])
			}
		}
	}

	// Traverse the LCS table and reconstruct the optimal edits.
	for i, j := 0, 0; i < n || j < m; {
		switch {
		case i < n && j < m && a[i] == b[j]:
			script = appendDelta(script, Keep, a[i])
			i++
			j++
		case i < n && (j == m || lcs[i+1][j] >= lcs[i][j+1]):
			script = appendDelta(script, Delete, a[i])
			i++
		default:
			script = appendDelta(script, Insert, b[j])
			j++
		}
	}

	return script
}

A practical system would use Myers diff or Bram Cohen’s patience diff. James Coglan, the author of Building Git, wrote a series of blog posts explaining Myers diff and patience diff in great detail.

Now that we have deltas, let’s add them to a weave.

Extending weaves

The Interleave function takes a weave and a version within that weave (identified by an active set) and incorporates new deltas, producing a new weave.

It calls the Reconstruct function to annotate each active line in the weave and then traverses the instructions and deltas in parallel, adding new instructions that match delta actions. The parallel loop body is the following procedure:

  1. Copy unmarked instructions to the output as is.
  2. Upon finding the first marked line, stop and consult the delta action.
  3. If the action is Insert, append its items as lines to the output, surrounding them by a BeginInsert/End block.
  4. If the action is Delete, open a BeginDelete block, copy the input stream as long as its active lines match the delta items, then close the block.
  5. If the action is Keep, copy the input until all items have appeared as active lines.
The Interleave function adds new deltas to a weave.
// Interleave adds new deltas to a weave.
//
// Returns [ErrBadWeave] if the instructions aren't a valid weave.
// Returns [ErrBadDelta] if the deltas don't apply cleanly.
func Interleave(
	instructions []Instruction,
	activeSet ActiveSet,
	deltas []Delta,
	v VersionID,
) (out []Instruction, err error) {
	mask, _, err := Reconstruct(instructions, activeSet)
	if err != nil {
		return nil, err
	}

	i, j, n, m := 0, 0, len(instructions), len(deltas)

	// consumeLines outputs instructions
	// until it sees all expected lines.
	consumeLines := func(expected []int) error {
		skipped := 0
		for skipped < len(expected) {
			if i >= n {
				return ErrBadDelta
			}
			out = append(out, instructions[i])
			if mask[i] {
				if expected[skipped] != instructions[i].Line() {
					return ErrBadDelta
				}
				skipped++
			}
			i++
		}
		return nil
	}

	for i < n || j < m {
		for i < n && !mask[i] { 
			out = append(out, instructions[i])
			i++
		}
		if j < m {
			delta := deltas[j]
			switch delta.Action { 
			case Insert: 
				out = append(out, Instruction{BeginInsert, int(v)})
				for _, item := range delta.Items {
					out = append(out, Instruction{Line, item})
				}
				out = append(out, Instruction{End, int(v)})
			case Delete: 
				out = append(out, Instruction{BeginDelete, int(v)})
				if err := consumeLines(delta.Items); err != nil {
					return nil, err
				}
				out = append(out, Instruction{End, int(v)})
			case Keep: 
				if err := consumeLines(delta.Items); err != nil {
					return nil, err
				}
			}
			j++
		} else if i < n {
			return nil, ErrBadDelta
		}
	}
	return out, nil
}

Now, we can recover any version stored in a weave and detect and apply changes. The last missing piece is extracting deltas.

Reconstructing deltas

How do we know what changed in version v? We could recover v and its predecessor and diff the outputs, but it would be wasteful, given that the weave already stores deltas. A more direct approach is to use the bit masks returned by the Reconstruct function to detect lines that become inactive if we deactivate v.

The ReconstructDelta function diffs two active revision sets.
// ReconstructDelta extracts the changes between two active version sets.
func ReconstructDelta(
	instructions []Instruction,
	beforeSet, afterSet ActiveSet,
) (deltas []Delta, err error) {
	before, _, err := Reconstruct(instructions, beforeSet)
	if err != nil {
		return nil, err
	}
	after, _, err := Reconstruct(instructions, afterSet)
	if err != nil {
		return nil, err
	}

	for i, instr := range instructions {
		if instr.Type != Line {
			continue
		}
		switch {
		case before[i] && after[i]:
			deltas = appendDelta(deltas, Keep, instr.Line())
		case !before[i] && after[i]:
			deltas = appendDelta(deltas, Insert, instr.Line())
		case before[i] && !after[i]:
			deltas = appendDelta(deltas, Delete, instr.Line())
		}
	}
	return deltas, nil
}

ReconstructDelta can compute deltas between any two versions. If we need the delta of a particular version v, we diff ActiveSet(v){v} and ActiveSet(v).

vSet := activeSetForVersion(versions, v)
vDelta, err := ReconstructDelta(instructions, vSet.Deactivate(v), vSet)

Now we have enough machinery to build a simple version control system that can support most of sccs’s features.

Conclusion

Nobody in their right mind would use sccs today, yet it shaped modern software as few systems did. Marc J. Rochkind’s paper won the first icse Most Influential Paper Award in 1989, but as David Parnas noted See A Retrospective on the Source Code Control System by Marc J. Rochkind. , the system was influential, but the paper itself didn’t matter. What mattered most was the idea that software evolution can and should be tracked.

Git is ubiquitous today, and its connection to sccs is more than ideological. Git borrowed many ideas from BitKeeper, a distributed version control system that powered Linux kernel development between 2002 and 2005. BitKeeper, in turn, was a direct descendant of sccs and used weaves to track changes.

Conflict-free Replicated Data Types and the Pijul version control system employ data structures that are strikingly similar to weaves because their history representation makes conflicts easier to resolve Ironically, sccs didn’t support merges. .

Weaves always return.

Exercises

  1. Explain why Reconstruct pushes inactive BeginInsert instructions onto the queue but ignores inactive BeginDelete instructions.
  2. Assemble weave algorithms into a simple single-file version control system.
  3. Extend the system to detect history file corruptions.
  4. Implement two-way merge for any two versions.
  5. Extend the system to handle repositories, not just single files.

Similar articles