Enumerating fixed-weight bitstrings

Daniel Beer

6 Nov 2015

Given the set of all bitstrings of length \(n\) with exactly \(k\) bits set, you can arrange them in lexicographical order. You can then convert between bitstrings and indices into this list. Can you do this efficiently when \(n\) is large and the list is too large to fit into memory?

Yes you can, as it turns out. The number of bitstrings of length \(n\) with \(k\) bits set (which we'll refer to as \((n, k)\)-bitstrings) is given by:

\[ \binom{n}{k} = \frac{n!}{(n-k)! k!} \]

We can divide \((n, k)\)-bitstrings (with \(n, k > 0\)) into two sub-groups which do not overlap lexicographically. They are:

We can then think of all the bitstrings as being arranged in a binary tree. Each level of the tree corresponds to a bit position (leftmost bit at the top of the tree). Each leaf corresponds to a completed \((n, k)\)-bitstring. Nodes in this tree will have up to two children, provided there exist completions for the given choice of bit at that level. An in-order traversal of this tree results in the enumeration of all bitstrings in lexicographical order.

Now, we can adapt ordinary binary-tree algorithms for finding the index of a given bitstring. We know the size of each subtree (each subtree corresponds to the set of \((n,k)\)-bitstrings for some choice of \(n, k\)). We just need to trace the path from the root to the leaf corresponding to our chosen bitstring, and at each step add the size of all subtrees lexicographically prior to the one we descend into. That is, add the size of the left subtree if we take the right branch. There's nothing to add if we take the left branch.

We know that the size of the whole tree is \(\binom{n}{k}\), and we can easily keep track \(n\) and \(k\) as we descend. If we know these things, we can efficiently calculate the sizes of each of the left and right subtrees. The size of the left subtree is:

\[ \binom{n-1}{k} = \frac{n-k}{n}\binom{n}{k} \]

The size of the right subtree is:

\[ \binom{n-1}{k-1} = \frac{k}{n}\binom{n}{k} \]

Using these identities, we can efficiently keep track of the size of the current node and its children as we descend.

Calculating the inverse -- that is, producing a given \((n, k)\)-bitstring from its index into the list of all such bitstrings -- is done by a similar tree-descent algorithm. We keep track of node and subtree sizes in exactly the same way, but we have a slightly different criteria for deciding which branch to take: if our target index, \(i\), is greater than or equal to the size of the left subtree, then the string must be located in the right subtree (and its index within that subtree is \(i-L\), where \(L\) is the size of the left subtree).

The following program implements and tests the algorithm described: