Fast multi-word software shift registers
14 Sep 2014
This article describes a method of implementing fixed-size bit-queues, or shift registers, for very fast real-time operation. The method described is far faster and more flexible than shifting through multiple machine words.
Recently, I implemented a microcontroller system with a PS/2 keyboard input. The microcontroller didn’t have a PS/2 peripheral module, so this interface had to be implemented using bit-bashing.
The PS/2 protocol is a synchronous serial stream. In order to implement it, I would have to have an interrupt handler triggered by an edge on the CLOCK line, which then sampled and processed the state of the DATA line.
On average, the data rate of the protocol is very slow: probably no more than about 100 bits per second at most. However, the data typically comes transmitted in bursts of 20-30 bits clocked by a 10 kHz clock. This means that the interrupt handler has only 100 uS to store the incoming bit (which I would process later in the main thread). On a 1 MHz MSP430, this meant about 30 or so instructions!
The maximum burst length is just over 30 bits, so I’d need to store about 40 bits in a queue. The obvious way to do this would be to declare a uint64_t and shift data in:
uint64_t queue;
void sample_slow(void)
{
queue = (queue << 1) | (P1IN & 1);
}
But this results in surprisingly expensive code. The reason is that on a 16-bit machine, we have the following sequence of actions to perform:
- Load 4 words from memory into registers.
- Bit-shift left through 4 words.
- Store the incoming bit in the LSB of one of the words.
- Store 4 words from registers into memory.
Step 2 is particularly expensive. We can make an improvement by changing the sequence of actions:
- Load 4 words from memory into registers.
- Bit-shift left through one word.
- Store the incoming bit in the LSB of one of the words.
- Store 4 words from registers into memory in a rotated order.
In code:
uint16_t q0, q1, q2, q3;
void sample_fast(void)
{
const uint16_t tmp = (q0 << 1) | (P1IN & 1);
q0 = q1;
q1 = q2;
q2 = q3;
q3 = tmp;
}
To see how this works, consider just three 8-bit words capturing the sequence of bits abcdefg...
. Suppose that we’ve captured the first 24 bits already, and now we want to shift in the bit Y
. The states of the stored words before and after using the first method are:
[a b c d e f g h] [b c d e f g h i]
[i j k l m n o p] ==> [j k l m n o p q]
[q r s t u v w x] [r s t u v w x Y]
With the second method, we end up spreading the sequence of bits in a round-robin fashion across the three words. The states before and after are:
[a d g j m p s v] [b e h k n q t w]
[b e h k n q t w] ==> [c f i l o r u x]
[c f i l o r u x] [d g j m p s v Y]
Notice that the three words have rotated upwards, and that two of the words haven’t changed their content at all. The savings from this method come from the fact that no carry-through is necessary.
Compare the generated code (MSP430) for each method:
0000804a <sample_slow>:
804a: 0b 12 push r11
804c: 0a 12 push r10
804e: 09 12 push r9
8050: 08 12 push r8
8052: 07 12 push r7
8054: 06 12 push r6
8056: 05 12 push r5
8058: 04 12 push r4
805a: 58 42 20 00 mov.b &0x0020,r8
805e: 4c 48 mov.b r8, r12
8060: 0d 43 clr r13
8062: 0e 43 clr r14
8064: 0f 43 clr r15
8066: 08 4c mov r12, r8
8068: 09 4d mov r13, r9
806a: 0a 4e mov r14, r10
806c: 0b 4f mov r15, r11
806e: 18 f3 and #1, r8 ;r3 As==01
8070: 09 f3 and #0, r9 ;r3 As==00
8072: 0a f3 and #0, r10 ;r3 As==00
8074: 0b f3 and #0, r11 ;r3 As==00
8076: 1c 42 02 02 mov &0x0202,r12
807a: 1d 42 04 02 mov &0x0204,r13
807e: 1e 42 06 02 mov &0x0206,r14
8082: 1f 42 08 02 mov &0x0208,r15
8086: 0c 5c rla r12
8088: 0d 6d rlc r13
808a: 0e 6e rlc r14
808c: 0f 6f rlc r15
808e: 04 48 mov r8, r4
8090: 05 49 mov r9, r5
8092: 06 4a mov r10, r6
8094: 07 4b mov r11, r7
8096: 04 dc bis r12, r4
8098: 05 dd bis r13, r5
809a: 06 de bis r14, r6
809c: 07 df bis r15, r7
809e: 82 44 02 02 mov r4, &0x0202
80a2: 82 45 04 02 mov r5, &0x0204
80a6: 82 46 06 02 mov r6, &0x0206
80aa: 82 47 08 02 mov r7, &0x0208
80ae: 34 41 pop r4
80b0: 35 41 pop r5
80b2: 36 41 pop r6
80b4: 37 41 pop r7
80b6: 38 41 pop r8
80b8: 39 41 pop r9
80ba: 3a 41 pop r10
80bc: 3b 41 pop r11
80be: 30 41 ret
000080c0 <sample_fast>:
80c0: 5e 42 20 00 mov.b &0x0020,r14
80c4: 1e f3 and #1, r14 ;r3 As==01
80c6: 1f 42 0e 02 mov &0x020e,r15
80ca: 0f 5f rla r15
80cc: 92 42 0c 02 mov &0x020c,&0x020e
80d0: 0e 02
80d2: 92 42 00 02 mov &0x0200,&0x020c
80d6: 0c 02
80d8: 92 42 0a 02 mov &0x020a,&0x0200
80dc: 00 02
80de: 0e df bis r15, r14
80e0: 82 4e 0a 02 mov r14, &0x020a
80e4: 30 41 ret
Of course, having captured the data, you need some way of processing it later. The method I used was to trigger a wakeup from within the interrupt handler. The main thread would disable interrupts, copy the queue words, reset the sampler state, and then re-enable interrupts.
The captured data can then be shifted out in using a procedure similar to that used for shifting in.
If you’re doing something like this, the last problem you need to solve is to know how many bits the queue contains at the time the main thread copies it. One solution is to keep a separate counter variable which is incremented whenever a bit is shifted in. This counter is reset to 0 when the queue is copied.
Another method, which reduces the amount of work to be done in the interrupt handler, is to place a sentinel bit in the queue. When the queue is reset by the main thread, we set all queue words to 0, except for q3
, which is set to 1. This is equivalent to clearing all bits and then shifting in a single 1.
When we process the content of the copied queue, we shift out, counting the number of bits until a 1 pops out the end. Subtract this count from the size of the queue (minus 1), and you know how many bits were shifted in by the interrupt handler.