From Pascal strings to Python tuples

Pascal Cuoq - 31st Jul 2013

Quiz time

What does the program below do?

#include <stdio.h> 
int main(){ 
  struct { 
    int t[4]; 
    int u; 
  } v; 
  v.u = 3; 
  v.t[4] = 4; 
  printf(\v.u=%d"  v.u); 
  return 0; 
} 

Two answers are “it prints v.u=4” and “it prints v.u=3”:

$ gcc t.c && ./a.out  
v.u=4 
$ gcc -O2 t.c && ./a.out  
v.u=3 

The correct answer is of course that the program invokes undefined behavior. It is not that we are using at any time an lvalue of the wrong type to access memory breaking the so-called “strict aliasing rules”. It is not that v.t+4 is outside of object v. The problem is that v.t+4 is outside object v.t. So GCC does what it pleases and when compiling with -O2 optimizes brutally:

$ gcc -S -O2 t.c && cat t.s 
.LC0: 
	.string	"v.u=%d" 
… 
	movl 	$3  %edx 
	movl 	$.LC0  %esi 
	movl	 $1  %edi 
	xorl	%eax  %eax 
	call	__printf_chk 

Frama-C's value analysis warns for the above program:

$ frama-c -val t.c 
t.c:9:[kernel] warning: accessing out of bounds index {4}. assert 4 < 4; 

In general accessing t[i] when t is an array of size 4 is only valid when i < 4 but here the index is hard-coded as 4 so line 9 is only valid when 4 < 4. That is never: all executions that reach line 9 encounter undefined behavior there.

Second quiz same as the first quiz

What does the program below do?

#include "stdlib.h" 
typedef struct{ 
  int tab[1]; 
} ts; 
int main() { 
  ts *q = malloc(5*sizeof(int)); 
  q->tab[2]= 5; 
  return 1; 
} 

If you guessed “invoke undefined behavior” well done!

The program above was shown to me by facetious colleague Bernard Botella who is hard at work analyzing Python 2.7.4's runtime in the context of a project named SafePython. The snippet above is his reduced version of a larger piece of C code he found there. The issue Bernard was having started with the type definition below and I will let you guess the rest:

typedef struct { 
   PyObject_VAR_HEAD 
   PyObject *ob_item[1]; 
   /* ob_item contains space for 'ob_size' elements. 
    * Items must normally not be NULL  except during construction when 
    * the tuple is not yet visible outside the function that builds it. 
    */ 
} PyTupleObject; 

In C90 the “array of size 1 as last member of a struct” was a common idiom for implementing things like Pascal strings. And of course it is just as valid for variable-length tuples. The problem is that this is not 1990 any more: compilers now use undefined behavior as an excuse to optimize aggressively and the idiom is no longer valid at all for either tuples or Pascal strings. On the plus side in the C99 standard we got “incomplete types” a safe way to implement tuples and Pascal strings:

typedef struct { 
   PyObject_VAR_HEAD 
   PyObject *ob_item[]; 
… 

Conclusion

I have encouraged my colleague Bernard to report the above as a bug in Python. This kind of bug report is usually ignored because it denounces idioms that programmers have used for a long time and that they think they understand. Just remember: if you think you can predict what the program in the second quiz does you should be able to predict what the program in the first quiz does (or explain what is different about it).

Pascal Cuoq
31st Jul 2013