Fast multi-word software shift registers

Daniel Beer

14 Sep 2014

This article describes a method of implementing fixed-size bit-queues, or shift registers, for very fast real-time operation. The method described is far faster and more flexible than shifting through multiple machine words.

Recently, I implemented a microcontroller system with a PS/2 keyboard input. The microcontroller didn't have a PS/2 peripheral module, so this interface had to be implemented using bit-bashing.

The PS/2 protocol is a synchronous serial stream. In order to implement it, I would have to have an interrupt handler triggered by an edge on the CLOCK line, which then sampled and processed the state of the DATA line.

On average, the data rate of the protocol is very slow: probably no more than about 100 bits per second at most. However, the data typically comes transmitted in bursts of 20-30 bits clocked by a 10 kHz clock. This means that the interrupt handler has only 100 uS to store the incoming bit (which I would process later in the main thread). On a 1 MHz MSP430, this meant about 30 or so instructions!

The maximum burst length is just over 30 bits, so I'd need to store about 40 bits in a queue. The obvious way to do this would be to declare a uint64_t and shift data in:

uint64_t queue;

void sample_slow(void)
        queue = (queue << 1) | (P1IN & 1);

But this results in surprisingly expensive code. The reason is that on a 16-bit machine, we have the following sequence of actions to perform:

  1. Load 4 words from memory into registers.
  2. Bit-shift left through 4 words.
  3. Store the incoming bit in the LSB of one of the words.
  4. Store 4 words from registers into memory.

Step 2 is particularly expensive. We can make an improvement by changing the sequence of actions:

  1. Load 4 words from memory into registers.
  2. Bit-shift left through one word.
  3. Store the incoming bit in the LSB of one of the words.
  4. Store 4 words from registers into memory in a rotated order.

In code:

uint16_t q0, q1, q2, q3;

void sample_fast(void)
        const uint16_t tmp = (q0 << 1) | (P1IN & 1);

        q0 = q1;
        q1 = q2;
        q2 = q3;
        q3 = tmp;

To see how this works, consider just three 8-bit words capturing the sequence of bits abcdefg.... Suppose that we've captured the first 24 bits already, and now we want to shift in the bit Y. The states of the stored words before and after using the first method are:

[a b c d e f g h]     [b c d e f g h i]
[i j k l m n o p] ==> [j k l m n o p q]
[q r s t u v w x]     [r s t u v w x Y]

With the second method, we end up spreading the sequence of bits in a round-robin fashion across the three words. The states before and after are:

[a d g j m p s v]     [b e h k n q t w]
[b e h k n q t w] ==> [c f i l o r u x]
[c f i l o r u x]     [d g j m p s v Y]

Notice that the three words have rotated upwards, and that two of the words haven't changed their content at all. The savings from this method come from the fact that no carry-through is necessary.

Compare the generated code (MSP430) for each method:

0000804a <sample_slow>:
    804a:   0b 12           push    r11
    804c:   0a 12           push    r10
    804e:   09 12           push    r9
    8050:   08 12           push    r8
    8052:   07 12           push    r7
    8054:   06 12           push    r6
    8056:   05 12           push    r5
    8058:   04 12           push    r4
    805a:   58 42 20 00     mov.b   &0x0020,r8
    805e:   4c 48           mov.b   r8,     r12
    8060:   0d 43           clr     r13
    8062:   0e 43           clr     r14
    8064:   0f 43           clr     r15
    8066:   08 4c           mov     r12,    r8
    8068:   09 4d           mov     r13,    r9
    806a:   0a 4e           mov     r14,    r10
    806c:   0b 4f           mov     r15,    r11
    806e:   18 f3           and     #1,     r8      ;r3 As==01
    8070:   09 f3           and     #0,     r9      ;r3 As==00
    8072:   0a f3           and     #0,     r10     ;r3 As==00
    8074:   0b f3           and     #0,     r11     ;r3 As==00
    8076:   1c 42 02 02     mov     &0x0202,r12
    807a:   1d 42 04 02     mov     &0x0204,r13
    807e:   1e 42 06 02     mov     &0x0206,r14
    8082:   1f 42 08 02     mov     &0x0208,r15
    8086:   0c 5c           rla     r12
    8088:   0d 6d           rlc     r13
    808a:   0e 6e           rlc     r14
    808c:   0f 6f           rlc     r15
    808e:   04 48           mov     r8,     r4
    8090:   05 49           mov     r9,     r5
    8092:   06 4a           mov     r10,    r6
    8094:   07 4b           mov     r11,    r7
    8096:   04 dc           bis     r12,    r4
    8098:   05 dd           bis     r13,    r5
    809a:   06 de           bis     r14,    r6
    809c:   07 df           bis     r15,    r7
    809e:   82 44 02 02     mov     r4,     &0x0202
    80a2:   82 45 04 02     mov     r5,     &0x0204
    80a6:   82 46 06 02     mov     r6,     &0x0206
    80aa:   82 47 08 02     mov     r7,     &0x0208
    80ae:   34 41           pop     r4
    80b0:   35 41           pop     r5
    80b2:   36 41           pop     r6
    80b4:   37 41           pop     r7
    80b6:   38 41           pop     r8
    80b8:   39 41           pop     r9
    80ba:   3a 41           pop     r10
    80bc:   3b 41           pop     r11
    80be:   30 41           ret

000080c0 <sample_fast>:
    80c0:   5e 42 20 00     mov.b   &0x0020,r14
    80c4:   1e f3           and     #1,     r14     ;r3 As==01
    80c6:   1f 42 0e 02     mov     &0x020e,r15
    80ca:   0f 5f           rla     r15
    80cc:   92 42 0c 02     mov     &0x020c,&0x020e
    80d0:   0e 02
    80d2:   92 42 00 02     mov     &0x0200,&0x020c
    80d6:   0c 02
    80d8:   92 42 0a 02     mov     &0x020a,&0x0200
    80dc:   00 02
    80de:   0e df           bis     r15,    r14
    80e0:   82 4e 0a 02     mov     r14,    &0x020a
    80e4:   30 41           ret

Of course, having captured the data, you need some way of processing it later. The method I used was to trigger a wakeup from within the interrupt handler. The main thread would disable interrupts, copy the queue words, reset the sampler state, and then re-enable interrupts.

The captured data can then be shifted out in using a procedure similar to that used for shifting in.

If you're doing something like this, the last problem you need to solve is to know how many bits the queue contains at the time the main thread copies it. One solution is to keep a separate counter variable which is incremented whenever a bit is shifted in. This counter is reset to 0 when the queue is copied.

Another method, which reduces the amount of work to be done in the interrupt handler, is to place a sentinel bit in the queue. When the queue is reset by the main thread, we set all queue words to 0, except for q3, which is set to 1. This is equivalent to clearing all bits and then shifting in a single 1.

When we process the content of the copied queue, we shift out, counting the number of bits until a 1 pops out the end. Subtract this count from the size of the queue (minus 1), and you know how many bits were shifted in by the interrupt handler.