A Better Varargs

I love those days when I come across a really good idea, something simple and helpful that I can use time and time again.

Today, I read on stackoverflow a different approach to supplying a variable number of arguments to a function (a varargs function or variadic function) from Geo Carncross.^[1]

Now, there’s a lot of ways to skin a cat, and the traditional stdarg(3) does the job. So, what’s striking about this different approach? Well, it’s type safe, and it permits zero arguments; neither of which are possible with stdarg(3).

Examples

Using `stdarg(3)`:

static void foo(const char *p, ...)
{
	va_list l;

	va_start(l, p);
	for (; p; p = va_arg(l, char *))
		printf("%s\t", p);
	va_end(l);

	putchar('\n');
}

#define foo(x, ...) foo(x, __VA_ARGS__, NULL)

This shows the traditional use of stdarg(3). A NULL signifies the end of arguments. A convenient macro supplies the NULL automatically.

Using an Array:

static void bar(int n, const char *p[])
{
	for (int i = 0; i < n; i++)
		printf("%s\t", p[i]);
	putchar('\n');
}

#define bar(...) ({					\
	const char *_args[] = {__VA_ARGS__};		\
	bar(sizeof(_args)/sizeof(_args[0]), _args);	\
})

Here’s the array-based approach Geo Carncross described. Amazingly, there’s nothing special about the bar() function. Its first argument is the count of elements of its second argument, an array. The bar() macro creates a local array and initializes it with a compound literal.^[2] The __VA_ARGS__ argument is expanded by the C preprocessor to be all the arguments supplied to the macro. The macro calls the bar() function using the sizeof operator to compute the number of arguments.

The _args array comes with a type, so the compiler can check the arguments.

If I call the stdarg(3)-based foo() like so, foo("abc", "def", 9);, there’s no warning at compile time. And at runtime…? Segmentation fault

If I call the array-based bar() the same way, bar("abc", "def", 9);, the compiler warns me at compile time:

bar.c: In function 'main':
bar.c:51:2: warning: initialization makes pointer from integer without a cast [enabled by default]
  bar("abc", "def", 9);

Now, it’s just a warning, so if I ignore it and run the code as-is, I get the same Segmentation fault.

Taking a Closer Look

GCC has supported C99 compound literals since at least 2002;^[3] and yet, I’ve never seen this approach before. So, surely, this must be too good to be true, right? There must be some performance penalty.

When in doubt, read the source - or in this case, the assembly code.

`foo()` disassembled: ^[4]

0x00000000004006c0 <+0>:     sub    $0x58,%rsp
0x00000000004006c4 <+4>:     test   %rdi,%rdi
0x00000000004006c7 <+7>:     lea    0x60(%rsp),%rax
0x00000000004006cc <+12>:    mov    %rsi,0x28(%rsp)
0x00000000004006d1 <+17>:    mov    %rdx,0x30(%rsp)
0x00000000004006d6 <+22>:    mov    %rcx,0x38(%rsp)
0x00000000004006db <+27>:    mov    %r8,0x40(%rsp)
0x00000000004006e0 <+32>:    mov    %rdi,%rsi
0x00000000004006e3 <+35>:    mov    %rax,0x10(%rsp)
0x00000000004006e8 <+40>:    lea    0x20(%rsp),%rax
0x00000000004006ed <+45>:    mov    %r9,0x48(%rsp)
0x00000000004006f2 <+50>:    movl   $0x8,0x8(%rsp)
0x00000000004006fa <+58>:    mov    %rax,0x18(%rsp)
0x00000000004006ff <+63>:    jne    0x40071e <foo+94>
0x0000000000400701 <+65>:    jmp    0x400749 <foo+137>
0x0000000000400703 <+67>:    nopl   0x0(%rax,%rax,1)
0x0000000000400708 <+72>:    mov    %edx,%eax
0x000000000040070a <+74>:    add    0x18(%rsp),%rax
0x000000000040070f <+79>:    add    $0x8,%edx
0x0000000000400712 <+82>:    mov    %edx,0x8(%rsp)
0x0000000000400716 <+86>:    mov    (%rax),%rsi
0x0000000000400719 <+89>:    test   %rsi,%rsi
0x000000000040071c <+92>:    je     0x400749 <foo+137>
0x000000000040071e <+94>:    xor    %eax,%eax
0x0000000000400720 <+96>:    mov    $0x4007f0,%edi
0x0000000000400725 <+101>:   callq  0x400460 <printf@plt>
0x000000000040072a <+106>:   mov    0x8(%rsp),%edx
0x000000000040072e <+110>:   cmp    $0x30,%edx
0x0000000000400731 <+113>:   jb     0x400708 <foo+72>
0x0000000000400733 <+115>:   mov    0x10(%rsp),%rax
0x0000000000400738 <+120>:   mov    (%rax),%rsi
0x000000000040073b <+123>:   lea    0x8(%rax),%rdx
0x000000000040073f <+127>:   mov    %rdx,0x10(%rsp)
0x0000000000400744 <+132>:   test   %rsi,%rsi
0x0000000000400747 <+135>:   jne    0x40071e <foo+94>
0x0000000000400749 <+137>:   mov    0x200460(%rip),%rsi        # 0x600bb0 <stdout@@GLIBC_2.2.5>
0x0000000000400750 <+144>:   mov    $0xa,%edi
0x0000000000400755 <+149>:   callq  0x400470 <_IO_putc@plt>
0x000000000040075a <+154>:   add    $0x58,%rsp
0x000000000040075e <+158>:   retq

`bar()` disassembled: ^[5]

0x00000000004007e0 <+0>:     push   %rbp
0x00000000004007e1 <+1>:     push   %rbx
0x00000000004007e2 <+2>:     sub    $0x8,%rsp
0x00000000004007e6 <+6>:     test   %edi,%edi
0x00000000004007e8 <+8>:     jle    0x400810 <bar+48>
0x00000000004007ea <+10>:    sub    $0x1,%edi
0x00000000004007ed <+13>:    mov    %rsi,%rbx
0x00000000004007f0 <+16>:    lea    0x8(%rsi,%rdi,8),%rbp
0x00000000004007f5 <+21>:    nopl   (%rax)
0x00000000004007f8 <+24>:    mov    (%rbx),%rsi
0x00000000004007fb <+27>:    xor    %eax,%eax
0x00000000004007fd <+29>:    mov    $0x400960,%edi
0x0000000000400802 <+34>:    add    $0x8,%rbx
0x0000000000400806 <+38>:    callq  0x400460 <printf@plt>
0x000000000040080b <+43>:    cmp    %rbp,%rbx
0x000000000040080e <+46>:    jne    0x4007f8 <bar+24>
0x0000000000400810 <+48>:    mov    0x200539(%rip),%rsi        # 0x600d50 <stdout@@GLIBC_2.2.5>
0x0000000000400817 <+55>:    add    $0x8,%rsp
0x000000000040081b <+59>:    mov    $0xa,%edi
0x0000000000400820 <+64>:    pop    %rbx
0x0000000000400821 <+65>:    pop    %rbp
0x0000000000400822 <+66>:    jmpq   0x400470 <_IO_putc@plt>

Even if I knew nothing about assembly, it’d still be pretty obvious that foo() is twice as large as bar(). Since I do know a thing or two about it, I see that foo() is using 0x58 or 88 bytes of stack, while bar() is using only 0x8 or 8 bytes.

Still, that’s not the whole story. At a guess, I’d expect there to be a cost to the creation of the compound literal.

Calling `foo()` from `main()` with three strings (plus a final `NULL`)

0x00000000004004c8 <+40>:    xor    %ecx,%ecx
0x00000000004004ca <+42>:    mov    $0x400b09,%edx
0x00000000004004cf <+47>:    mov    $0x400b0d,%esi
0x00000000004004d4 <+52>:    mov    $0x400b11,%edi
0x00000000004004d9 <+57>:    xor    %eax,%eax
0x00000000004004db <+59>:    callq  0x4009d0 <foo>

Calling `bar()` from `main()` with three strings

0x00000000004005c0 <+288>:   lea    0x60(%rsp),%rsi
0x00000000004005c5 <+293>:   mov    $0x3,%edi
0x00000000004005ca <+298>:   movq   $0x400b11,0x60(%rsp)
0x00000000004005d3 <+307>:   movq   $0x400b0d,0x68(%rsp)
0x00000000004005dc <+316>:   movq   $0x400b09,0x70(%rsp)
0x00000000004005e5 <+325>:   callq  0x400930 <bar>

The call to foo() uses 24 bytes while the call to bar() uses 42 bytes, a difference of 18 bytes. If I call them with 16 strings, foo() uses 137 bytes and bar() uses 195 bytes, a difference of 58 bytes. The byte counts here refers to the instruction and operand size, not the stack used. For example, xor %eax,%eax, takes two bytes: 31 c0.

Why the difference? Well, the x86-64 calling convention specifies that the first 6 integer or pointer arguments to a function are passed via registers. Arguments beyond the sixth are passed on the stack.^[6]

For foo() called with 16 strings, the code to put the first six arguments in registers is 32 bytes long, then 9 bytes each to put the remaining 11 arguments (the last is the NULL) on the stack, and a few more bytes to zero a register and call the function.

bar(), on the other hand, takes just 2 arguments, the integer count, and the array pointer. The compiler does the sizeof computation itself, so count is an immediate constant. The code to put those two arguments in registers is 10 bytes long. The array initialized by the compound literal lives on the stack. The code to put the 16 string addresses on the stack is 9 bytes each for 4 of them, with the rest taking 12 bytes each. Add a final 5 bytes to call the function.

For stack usage, in the 16 string case, bar() stores 5 more pointers than foo() which is 40 more bytes. The difference is 5 pointers because foo() passes 6 arguments in registers, and needs one additional NULL argument.

To compare performance, I made a microbenchmark^[7] (implemented as a kernel module hoarding a CPU and reading the TSC register^[8] ). For these comparisons, the functions were modified to check their arguments for an initial nul character instead of printing, and return an integer.

On an Intel Xeon 2.4 GHz CPU, foo() ranged from 10 cycles (~4 nanoseconds) for 2 arguments to 84 ± 12 cycles (35 ± 5 ns) for 16 arguments. Interestingly, bar() appeared to take no time at all for the 2 and 4 argument cases. It turns out that gcc aggressively inlines bar(), but never inlines foo(). Disabling inlining made bar() take the same time as foo() for 2 and 4 arguments. For 16 arguments, bar() took 41 ± 5 cycles (17 ± 2 ns).

Caveats

There is a shortcoming of the array-based approach to which I was completely oblivious until it was pointed out to me (Thanks, Scott!). The programmer has to explicitly write the macro with the typed array. The stdarg(3) approach doesn’t require a macro at all. So there’s one more thing to do. Although foo() sans macro and bar() with macro have the same number of logical lines of code.^[9]

If a function takes arguments of more than one type, like printf(3), then stdarg(3) is still the way to go, perhaps with the format attribute.^[10]

Conclusion

I found that calling the array-based bar() uses more instructions and stack than calling the stdarg(3)-based foo(), but overall foo() takes more space. My microbenchmark showed bar() running twice as fast as foo() for larger numbers of arguments. Compiler inlining of bar() can reduce overhead even further.

Best of all, with the array-based approach I get type safety, so the compiler helps me to write better code! That’s what I call a really good idea.

Notes

[1] http://stackoverflow.com/questions/10533842/typesafe-varargs-in-c-with-gcc

[2] Read more about compound literals at http://www.drdobbs.com/the-new-c-compound-literals/184401404.

[3] http://gcc.gnu.org/c99status.html

[4] Compiled with gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9) and -O2.

[5] Amusingly, the disassembly of bar() is identical to this rather more quaint version:

static void baz(int n, const char *p[])
{
	while (n--)
		printf("%s\t", *p++);
	putchar('\n');
}

[6] Read more about the x86-64 calling convention at http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/

[7] All benchmarks, and especially microbenchmarks, should be viewed with a healthy amount of skepticism for their applicability to the real world.

[8] http://www.intel.com/…/ia-32-ia-64-benchmark-code-execution-paper.pdf

[9] Definitions for lines of code are notoriously ambiguous. Here, I’m counting any line containing a keyword or identifier.

[10] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

Thanks to Scott Gregory for reading a draft of this.

Code Acumen

Examples

Using stdarg(3):