Interleaved deltas
✑dsa- Weave structure
- Active revision set
- Reconstructing revisions
- Computing deltas
- Extending weaves
- Reconstructing deltas
- Conclusion
- Exercises
A weave has a reputation of one of the most reinvented data structures in history.
Progress is often viewed as a transition from simple to complex. Although individual systems do become more complex over time, long-term progress happens when new ideas and advances in hardware make simple designs feasible.
Think of the simplest version control system you can imagine. It probably stores project snapshots in a backup directory. Maybe it compresses files for efficiency. Add hashing for content addressing, and the sketch starts to resemble Git. I don’t mean this as an insult to Git; quite the opposite:
the structure of a Git repository is stunningly simple. It’s so simple in fact, that you’ll be surprised at the sophistication of the things you can do with it. But that’s the way the best code will always be: simple, solid premises out of which complex applications arise.
The Source Code Control System (sccs), developed by Marc J. Rochkind in the early seventies, had to be more sophisticated. It operated on one file at a time and stored all its revisions in a data structure called interleaved deltas or weaves.
If Git felt like something I could have invented myself,
weaves captivated and intimidated me for years.
When I finally found time to work out the algorithms,
the resources were scarce
BitKeeper wiki offers a good introduction.
Bram Cohen wrote a reference implementation, but I found it impenetrable.
,
so I published my study for future explorers.
You can find the full source code at roman-kashitsyn/weaver.
Weave structure
A weave is a sequence of instructions for reconstructing file revisions. An instruction consists of a type and an operand.
| Type | Operand | Meaning |
|---|---|---|
Line | Line index | Output a line. |
BeginInsert | Version number | Start a block of lines added in a version. |
BeginDelete | Version number | Start a block of lines deleted in a version. |
End | Version number | Close a block for a version. |
type InstructionType int
const (
Line InstructionType = iota
BeginInsert
BeginDelete
End
)
// Instruction is a weave instruction.
type Instruction struct {
Type InstructionType
// Payload contains the instruction payload.
// For Line instructions, it's the line number.
// For other instructions, it's the version ID.
Payload int
}
func (i Instruction) Line() int {
return i.Payload
}
func (i Instruction) VersionID() VersionID {
return VersionID(i.Payload)
}
All line instructions must belong to an Insert or Delete block.
The integer in the line instruction points into the global line pool.
Using line indices instead of strings shrinks the weave and makes instructions uniform.
Unlike in xml or json,
weave blocks can overlap.
For example, if v1 adds lines 1 and 2,
v2 appends lines 3–6,
and v3 removes lines 2–4,
then v3’s delta overlaps with deltas of the first two versions.
⎧ Insert v1
⎪ 1
Δ1 ⎨ Delete v3 ⎫
⎪ 2 ⎪
⎩ End v1 ⎪
⎧ Insert v2 ⎬ Δ3
⎪ 3 ⎪
⎪ 4 ⎪
Δ2 ⎨ End v3 ⎭
⎪ 5
⎪ 6
⎩ End v2
Versions in a weave depend on one another.
In the example above,
v3 depends on both v1 and v2
because it deletes lines that they introduced.
The relationship between v1 and v2 is ambiguous:
v1 could be v2’s parent,
or they could be independent branches that v3 merged.
The weave alone is not enough to deduce dependencies between deltas. Most weave-based systems express these dependencies using activation sets.
Active revision set
sccs has hierarchical revision names (1.4 and 2.1, for example).
When you check out a revision,
the system uses heuristics to compute its activation set—the collection of deltas that contribute to the file content.
I think of activation sets as switches controlling streetlamps on a dark avenue. Flipping them reveals or hides parts of the weave.
For my experiment, I chose flat sequential revision numbering. Each revision can have one or more parents Revisions with multiple parents represent merge commits. . Parent revisions are numerically smaller than their children, so no version can be its own ancestor. Activating a revision also activates all of its parents, so computing the active set requires a graph traversal.
type VersionID int
type ActiveSet []bool
func (s ActiveSet) Contains(v VersionID) bool {
return v >= 0 && int(v) < len(s) && s[v]
}
func (s ActiveSet) Deactivate(v VersionID) ActiveSet {
out := slices.Clone(s)
out[v] = false
return out
}
type Version struct {
ID VersionID
Parents []VersionID
Author string
Description string
Date time.Time
}
func activeSetForVersion(versions []Version, v VersionID) ActiveSet {
activeSet := make(ActiveSet, len(versions)+1)
stack := []VersionID{v}
for len(stack) > 0 {
w := stack[len(stack)-1]
stack = stack[:len(stack)-1]
activeSet[w] = true
for _, p := range versions[w-1].Parents {
if !activeSet[p] {
stack = append(stack, p)
}
}
}
return activeSet
}
We’re now ready to tackle the version reconstruction algorithm.
Reconstructing revisions
To reconstruct a revision, we compute its active set and scan the instruction list, keeping track of currently open blocks in a priority queue. If the instruction opens an insert block—for any revision—or a delete block for an active revision, we push it onto the queue. We copy lines if the block at the top of the priority queue is active (that’s why we keep inactive inserts in the queue).
We need a priority queue because blocks can overlap (see the Weave structure section for an example). If blocks were properly nested, a stack would suffice.
My implementation uses a sorted slice as a priority queue for brevity (heaps require a lot of boilerplate in Go).
Reconstruct function recovers the lines of any file version stored in a weave.
var (
ErrBadWeave = errors.New("E001: malformed weave")
ErrBadDelta = errors.New("E002: bad delta")
)
// Reconstruct locates the lines enabled in the given version set.
// It returns the boolean mask of enabled "Line" instructions
// and the IDs of versions that introduced the enabled lines.
//
// Returns [ErrBadWeave] if the instructions aren't a valid weave.
func Reconstruct(
instructions []Instruction,
activeSet ActiveSet,
) (mask []bool, versions []VersionID, err error) {
mask = make([]bool, len(instructions))
openBlocksByVersion := make(map[VersionID]int)
// activeBlocks are sorted by VersionID (ascending).
var activeBlocks []Instruction
activeBlock := func(v VersionID) (int, bool) {
return slices.BinarySearchFunc(activeBlocks, v,
func(instr Instruction, target VersionID) int {
return cmp.Compare(instr.VersionID(), target)
})
}
for offset, instr := range instructions {
switch instr.Type {
case Line:
if len(activeBlocks) == 0 {
return nil, nil, ErrBadWeave
}
top := activeBlocks[len(activeBlocks)-1]
v := top.VersionID()
if activeSet.Contains(v) && top.Type == BeginInsert {
mask[offset] = true
versions = append(versions, v)
}
case BeginInsert, BeginDelete:
v := instr.VersionID()
if _, ok := openBlocksByVersion[v]; ok {
return nil, nil, ErrBadWeave
}
if instr.Type == BeginInsert || activeSet.Contains(v) {
at, _ := activeBlock(v)
activeBlocks = slices.Insert(activeBlocks, at, instr)
}
openBlocksByVersion[v] = offset
case End:
v := instr.VersionID()
if _, ok := openBlocksByVersion[v]; !ok {
return nil, nil, ErrBadWeave
}
at, found := activeBlock(v)
if found {
activeBlocks = slices.Delete(activeBlocks, at, at+1)
}
delete(openBlocksByVersion, v)
default:
return nil, nil, ErrBadWeave
}
}
if len(openBlocksByVersion) > 0 {
return nil, nil, ErrBadWeave
}
return mask, versions, nil
}
Returning the bit mask of enabled line instructions (instead of their contents) simplifies many operations, including weave extension and delta reconstruction.
Before we can start adding changes to weaves, we need diffs.
Computing deltas
Given two sequences of items, we want to find a transformation that turns the first sequence into the second and uses the fewest steps.
The transformation is a sequence of deltas
(also known as hunks)
that specify an action (Insert, Delete, or Keep)
and the items it acts upon.
I called such a transformation a diff script.
The diff algorithm I’m most familiar with is the lcs algorithm See, for example, the first section of An Algorithm for Differential File Comparison by J. W. Hunt and M. D. McIlroy. . It computes an optimal diff script in two stages:
-
Use dynamic programming to build the
lcstable. Thelcs[i][j]entry is the length of a longest common subsequence ofa[i:]andb[j:]. Fill the table back-to-front in both dimensions. -
Traverse the
lcstable to recover the actions, moving towardlcsincrease whenaandbdon’t align.
DiffScript function computes an optimal edit script for two sequences of integers.
type Action int
const (
Insert Action = iota
Delete
Keep
)
type Delta struct {
Action Action
Items []int
}
func appendDelta(script []Delta, action Action, item int) []Delta {
n := len(script)
if n > 0 && script[n-1].Action == action {
script[n-1].Items = append(script[n-1].Items, item)
return script
}
return append(script, Delta{Action: action, Items: []int{item}})
}
// DiffScript computes the optimal edit script for turning a into b.
func DiffScript(a, b []int) (script []Delta) {
n, m := len(a), len(b)
// Build a table of lengths of LCS(a[i:], b[j:]).
lcs := make([][]int, n+1)
for i := range lcs {
lcs[i] = make([]int, m+1)
}
for i := range slices.Backward(a) {
for j := range slices.Backward(b) {
if a[i] == b[j] {
lcs[i][j] = 1 + lcs[i+1][j+1]
} else {
lcs[i][j] = max(lcs[i+1][j], lcs[i][j+1])
}
}
}
// Traverse the LCS table and reconstruct the optimal edits.
for i, j := 0, 0; i < n || j < m; {
switch {
case i < n && j < m && a[i] == b[j]:
script = appendDelta(script, Keep, a[i])
i++
j++
case i < n && (j == m || lcs[i+1][j] >= lcs[i][j+1]):
script = appendDelta(script, Delete, a[i])
i++
default:
script = appendDelta(script, Insert, b[j])
j++
}
}
return script
}
A practical system would use Myers diff or Bram Cohen’s patience diff. James Coglan, the author of Building Git, wrote a series of blog posts explaining Myers diff and patience diff in great detail.
Now that we have deltas, let’s add them to a weave.
Extending weaves
The Interleave function takes a weave
and a version within that weave (identified by an active set)
and incorporates new deltas,
producing a new weave.
It calls the Reconstruct function
to annotate each active line in the weave
and then traverses the instructions and deltas in parallel,
adding new instructions that match delta actions.
The parallel loop body is the following procedure:
- Copy unmarked instructions to the output as is.
- Upon finding the first marked line, stop and consult the delta action.
-
If the action is
Insert, append its items as lines to the output, surrounding them by aBeginInsert/Endblock. -
If the action is
Delete, open aBeginDeleteblock, copy the input stream as long as its active lines match the delta items, then close the block. -
If the action is
Keep, copy the input until all items have appeared as active lines.
Interleave function adds new deltas to a weave.
// Interleave adds new deltas to a weave.
//
// Returns [ErrBadWeave] if the instructions aren't a valid weave.
// Returns [ErrBadDelta] if the deltas don't apply cleanly.
func Interleave(
instructions []Instruction,
activeSet ActiveSet,
deltas []Delta,
v VersionID,
) (out []Instruction, err error) {
mask, _, err := Reconstruct(instructions, activeSet)
if err != nil {
return nil, err
}
i, j, n, m := 0, 0, len(instructions), len(deltas)
// consumeLines outputs instructions
// until it sees all expected lines.
consumeLines := func(expected []int) error {
skipped := 0
for skipped < len(expected) {
if i >= n {
return ErrBadDelta
}
out = append(out, instructions[i])
if mask[i] {
if expected[skipped] != instructions[i].Line() {
return ErrBadDelta
}
skipped++
}
i++
}
return nil
}
for i < n || j < m {
for i < n && !mask[i] {
out = append(out, instructions[i])
i++
}
if j < m {
delta := deltas[j]
switch delta.Action {
case Insert:
out = append(out, Instruction{BeginInsert, int(v)})
for _, item := range delta.Items {
out = append(out, Instruction{Line, item})
}
out = append(out, Instruction{End, int(v)})
case Delete:
out = append(out, Instruction{BeginDelete, int(v)})
if err := consumeLines(delta.Items); err != nil {
return nil, err
}
out = append(out, Instruction{End, int(v)})
case Keep:
if err := consumeLines(delta.Items); err != nil {
return nil, err
}
}
j++
} else if i < n {
return nil, ErrBadDelta
}
}
return out, nil
}
Now, we can recover any version stored in a weave and detect and apply changes. The last missing piece is extracting deltas.
Reconstructing deltas
How do we know what changed in version ?
We could recover and its predecessor and diff the outputs,
but it would be wasteful, given that the weave already stores deltas.
A more direct approach is
to use the bit masks returned by the Reconstruct function
to detect lines that become inactive if we deactivate .
ReconstructDelta function diffs two active revision sets.
// ReconstructDelta extracts the changes between two active version sets.
func ReconstructDelta(
instructions []Instruction,
beforeSet, afterSet ActiveSet,
) (deltas []Delta, err error) {
before, _, err := Reconstruct(instructions, beforeSet)
if err != nil {
return nil, err
}
after, _, err := Reconstruct(instructions, afterSet)
if err != nil {
return nil, err
}
for i, instr := range instructions {
if instr.Type != Line {
continue
}
switch {
case before[i] && after[i]:
deltas = appendDelta(deltas, Keep, instr.Line())
case !before[i] && after[i]:
deltas = appendDelta(deltas, Insert, instr.Line())
case before[i] && !after[i]:
deltas = appendDelta(deltas, Delete, instr.Line())
}
}
return deltas, nil
}
ReconstructDelta can compute deltas between any two versions.
If we need the delta of a particular version ,
we diff
and
.
vSet := activeSetForVersion(versions, v)
vDelta, err := ReconstructDelta(instructions, vSet.Deactivate(v), vSet)
Now we have enough machinery to build a simple version control system that can support most of sccs’s features.
Conclusion
Nobody in their right mind would use sccs today,
yet it shaped modern software as few systems did.
Marc J. Rochkind’s paper won the first icse Most Influential Paper Award in 1989,
but as David Parnas noted
See A Retrospective on the Source Code Control System by
Marc J. Rochkind.
, the system was influential, but the paper itself didn’t matter
.
What mattered most was the idea
that software evolution can and should be tracked.
Git is ubiquitous today, and its connection to sccs is more than ideological. Git borrowed many ideas from BitKeeper, a distributed version control system that powered Linux kernel development between 2002 and 2005. BitKeeper, in turn, was a direct descendant of sccs and used weaves to track changes.
Conflict-free Replicated Data Types and the Pijul version control system employ data structures that are strikingly similar to weaves because their history representation makes conflicts easier to resolve Ironically, sccs didn’t support merges. .
Weaves always return.
Exercises
- Explain why
Reconstructpushes inactiveBeginInsertinstructions onto the queue but ignores inactiveBeginDeleteinstructions. - Assemble weave algorithms into a simple single-file version control system.
- Extend the system to detect history file corruptions.
- Implement two-way merge for any two versions.
- Extend the system to handle repositories, not just single files.