C switch statement efficiency

MQ V8 introduced a header file, cmqstrc.h, that maps constants in the MQI to string values. I’ve mentioned it a few times recently, both here and in some other forums. One question came in, asking about how well that code worked. So I’ve recreated some tests I wrote back in 2015, to demonstrate what compilers do with that pattern. This shows the efficiency of writing large switch statements in C.

There’s a phrase I’ve seen a few times but whose source I can’t discover: “you have to write a compiler three times before you can call yourself a compiler writer”. I’ve never needed to write a compiler or assembler, unless you count a Core Wars implementation in FORTRAN (the course requirement language). And the last time I did anything serious with writing machine code I was using this book (and I wasn’t programming a Terminator):

6502 assembler
8-bit assembly

But very little specialised knowledge was needed to do this investigation, and (hopefully) to explain what I found.

The switch pattern

The repeated pattern of the MQ header file is a set of switch statements, one for each “prefix” in the MQI, with each prefix given a separate function. For example, to convert MQCC values into the corresponding strings, there is

 char *MQCC_STR (MQLONG v)
  {
    char *c;
    switch (v)
    {
    case         -1: c = "MQCC_UNKNOWN"; break;
    case          0: c = "MQCC_OK"; break;
    case          1: c = "MQCC_WARNING"; break;
    case          2: c = "MQCC_FAILED"; break;
    default: c = ""; break;
    }
    return c;
  }

What actually happens when you call that function? And how might it behave in a much larger function such as MQRC_STR – nearly 600 cases? Should I split the MQRC_STR function into subgroups, or reorder the statements in some way?

The test program

I wrote two pieces of code. One a trivial bit of C that calls a switch block:

#include <stdio.h>
char *switchTest(int i) {
  switch (i) {
  #include "gen.txt" // A script-generated block
  default:
    return "N/A";
  }
}
int main()  {
  int i=2035;
  printf("Returned value of %d is \"%s\"\n",i,switchTest(i));
  return(0);
}

And the other is a shell script that generated the contents of that switch:

#!/bin/bash
function gen {
 start=$1; stop=$2; filter=$3
 for i in $(seq $start $stop); do
   mod=$(($RANDOM % $filter)) # Knock out a few lines
   if [ $mod -ne 0 ]
   then
     echo "case $i: return \"$i\";"
   fi
 done
}
# Filter can be made smaller to remove more random entries
gen 2001 2003 32767 > gen.txt

So we have this file as the main body of the test case:

$ cat gen.txt
case 2001: return "2001";
case 2002: return "2002";
case 2003: return "2003";

The small test

Rather than compiling and running the program (although I did first check it worked), instead stop the compilation process with just the generated assembler: cc -S testcase.c which creates testcase.s. I’ve removed a few framing pieces from the output, but you can see the implementation of the switch code here:

.LC0: .string "2001"
.LC1: .string "2002"
.LC2: .string "2003"
.LC3: .string "N/A"
 switchTest:
         .cfi_startproc
         pushq   %rbp
         .cfi_def_cfa_offset 16
         .cfi_offset 6, -16
         movq    %rsp, %rbp
         .cfi_def_cfa_register 6
         movl    %edi, -4(%rbp)
         cmpl    $2003, -4(%rbp)
         je      .L2           // Jump if equal ...
         cmpl    $2003, -4(%rbp)
         jg      .L3           // Jump if greater then ...
         cmpl    $2001, -4(%rbp)
         je      .L4
         cmpl    $2002, -4(%rbp)
         je      .L5
         jmp     .L3           // Unconditional jump
 .L4:
         movl    $.LC0, %eax
         jmp     .L6
 .L5:
         movl    $.LC1, %eax
         jmp     .L6
 .L2:
         movl    $.LC2, %eax
         jmp     .L6
 .L3:
         movl    $.LC3, %eax
 .L6:
         popq    %rbp
         ret

Translating that back into a more readable style, we get something like

if i == 2003 GOTO L2
else if i > 2003 GOTO L3     // the 'jg' instruction
else if i == 2001 GOTO L4
else if i == 2002 GOTO L5
else GOTO L3

L2: return "2003"
L3: return "N/A"
L4: return "2001"
L5: return "2002"

There’s some interesting reordering going on here, but you can see how it’s basically testing each value in turn. Which doesn’t bode well for a large switch. But let’s see what really happens.

The large test

I generated something closer to the MQRC situation using the script. There were numbers from 2000-2600 and 6100-6150 with a sprinkling of missing lines.

This time the generated code looks very different. There is only a single direct comparison.

.LC0: .string "2000"
.LC1: .string "2001"
....
         subl    $2000, %eax  // Normalise to 0 base
         cmpl    $4150, %eax  // The only compare instruction used
         ja      .L2
         movl    %eax, %eax
         movq    .L4(,%rax,8), %rax
         jmp     *%rax
       
 .L4:
         .quad   .L615
         .quad   .L614
         .quad   .L613
 ....
 .L615:
         movl    $.LC0, %eax
         jmp     .L616
 .L614:
         movl    $.LC1, %eax
         jmp     .L616
....

Again doing a translation, renumbering the labels for clarity:

t = i-2000          // normalise to a zero offset
if t > 4150 GOTO L2 // out of bounds - direct jump
else label = table[t]
GOTO label 

table[0] = L2000
table[1] = L2001
…
table[27] = L2 // a "missing" case
…
table[4150] = L6150

L2: return "N/A"
L2000: return "2000"
…

What’s been done is called a jump table. Rather than testing every value in turn, the compiler has recognised that there is a sequence and can build a table of locations so we can step directly to the right address. The cost of returning the right string is the same regardless of the size of the switch block and the input value (ignoring processor cache lines etc). In this case, the single jump table is large enough to cover the entire range of options from 2000-6150, with all the unused entries being set to the address of the default label. One of the aspects of the C language that makes this strategy relatively obvious is that switch statements can only be done based on integer values; other languages where matches might be made on strings or patterns could make it more complex to implement. But still not usually as bad as testing every case in turn.

One further optimisation is visible here. In the small test, the generated code used the jg instruction (Jump If Greater); in the large test, there’s a ja (Jump Above) instruction. The difference between them is whether the comparison is using signed or unsigned values. If you had wondered what happened if I passed in the value 10 to the long switch, and why there was no “if less than 0” test after the subtraction, then this explains it. The expression “10 – 2000” is a lot bigger than 4150 if you ignore the sign bit on the result. So the single ja handles numbers outside the valid range at either end.

Some variations

I played with a few other scenarios. One of them was to see where gcc changed its behaviour from doing the direct comparison of each case into using the jump tables. It seemed that going from 4 to 5 entries was the break-point.

I also did a couple of tests to investigate whether all the undefined cases got their own jump table entry when there are very large gaps in the sequences. So one generated version used numbers 200-2000, 6100-6150, and 20000-21000 in a single block, with again a few random missing entries.

In that situation, gcc seemed to build two jump tables with a single extra comparison to work out which table to refer to. One table covered 200-6150 and the other covered 20000-21000.

Implementation differences

The fragments of assembler code here came from using gcc on an x64 Linux system. I also tried clang on the same machine. Although the code was slightly different, it showed the same fundamental pattern. I’d also expect to see similar behaviour on other platforms. Which proves to be the case using both gcc and xlc on AIX.

I considered looking at the gcc source code to see how it made its decisions, and if it built other patterns for other kinds of distribution. But I decided that the empirical evidence shown by the assembler output was good enough.

Optimisations and Speculations

These tests were done with “unoptimised” code. I did try turning on the optimiser (-O3) but that decided to remove nearly all the code as it could tell immediately what the correct returned string was going to be. Once the program runs on real hardware, there will be further optimisations with all the tricks that processor designers come up with including speculative execution based on what might have been returned on a previous run of the same function (though perhaps less of that since SPECTRE).

Summary

Compiler writers are usually clever people! They can spot the patterns and generate suitable code. It doesn’t mean we should write deliberately inefficient code in applications, but it does mean that we don’t need to play too many tricks to get “closer” in some sense to the underlying operations.

This post was last updated on February 25th, 2022 at 10:03 am

Leave a Reply

Your email address will not be published. Required fields are marked *