Robbert Krebbers

PhD thesis

thesis On December 1, 2015 at Radboud University Nijmegen, I have defended my PhD thesis entitled:

The C standard formalized in Coq

with cum laude distinction.

Download

Summary

This thesis describes a formal specification of the sequential fragment of the C programming language based on the official description of the C language, the C11 standard. Our formal specification of C, which is named CH2O, is used for the development of technology that enables verification of C programs in a standards compliant and compiler independent way.

The C programming language is addressed in this thesis because it is both among the most widely used and among the most bug-prone programming languages. C is widely used because of the performance of programs, the control over system resources such as memory, and the fact that C programs run on nearly any computer platform. On the other hand, C is bug-prone due to its weak type system and the absence of run-time checks. Our formal specification of C can be used as the basis for formal verification of C programs. This way, one can develop reliable C programs without sacrificing the benefits of C.

All parts of our formal specification of C are defined in the Coq proof assistant. Coq ensures that all definitions comprising the formal specification (the semantics) are mathematically well-formed and enjoy desirable properties. Most importantly, Coq allows us to define multiple versions of the C semantics and to prove that these versions correspond to each other. By developing multiple versions of the semantics, we have considered C from different perspectives and have thereby obtained a higher confidence in the correctness of our specification. We have, based on the C11 standard, defined the following kinds of semantics for C in Coq and proved that these correspond to each other.

Numerous metatheoretical properties of the formal semantics are proven in order to validate the formal definitions. The code extraction mechanism of Coq is used to turn the executable semantics into an interpreter, which is able to run the semantics on a test suite that comprises actual C programs.

Our formal version of C11 covers a large part of the C language including delicate features such as the C type system, arrays and pointers, structs and unions, typedefs and enums, implicit type conversions, object representations, expressions with side-effects in the presence of non-deterministic expression evaluation, local variables in the presence of non-local control (goto, break, continue, return and unstructured switch) and initializers.

Formalizing C is challenging not just because of the number of features, but also because C is oriented towards being efficiently implementable rather than being abstract in a mathematical sense. Not only does the behavior of each C program depend on implementation-defined properties such as sizes and endianness of integers, but in order to abstract even further from implementation specific choices, the C11 standard assigns a set of possible behaviors to each program instead of a unique behavior. In case of semantically illegal programs (those that have undefined behavior in C terminology) this set contains all possible behaviors (including letting the program crash). Surprisingly, correctly describing the set of programs that have undefined behavior in a formal manner is the hardest part of formalizing C.

Many difficulties in describing undefined behavior are due to the interaction between low-level and high-level data access in C. Low-level data access involves unstructured and untyped byte representations, and high-level data access involves abstract values such as structs and unions. Compilers often use a high-level view of data access to justify optimizations whereas many programmers expect data access to behave in a low-level way. Optimizations based on type-based alias analysis (effective types in C terminology) are examples thereof. The semantics in this thesis is therefore built on top of a novel typed memory model based on trees to correctly model the interaction between the low-level and high-level view of data access.

Our formal specification of C is faithful to the C11 standard, which means it describes all undefined behaviors of C11. As a consequence, when one proves something about a given program with respect to our semantics, it should behave that way with any ostensibly C11 compliant compiler such as GCC or Clang.