A tale of an impossible bug: big.LITTLE and caching

Rodrigo Kumpera and Bernhard Urban runtime

When someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE. ARM’s big.LITTLE architecture is the first mass produced AMP architecture and as we’ll see next, it raises the bar for how hard multi-core programing is.

A tale of an impossible bug

It all started with a bug report against a phone with such a CPU, the Exynos chipset used on Samsung phones in Europe. Apps created with our software were dying with SIGILL at all completely random places. Nothing could reasonably explain what was happening, and the crash was happening with valid instructions. This immediately made us suspect bad instruction cache flushing.

After reviewing all JIT code around cache flushing we were sure that we were calling __clear_cache properly. That lead us to look around for how other virtual machines or compilers do cache flushing on ARM64, and we found out about some related errata on the Cortex A53. ARM’s description of those issues is both cryptic and vague, but we tried the workaround anyways. No luck there.

Next we went with the other usual suspects. A lying signal handler? Nope. Funky userspace CPU emulation? No. Broken libc implementation? Nice try. Faulty hardware? We reproduced it on multiple devices. Bad luck or karma? Yes!

Some of us could not sleep with such amazing puzzle in front of us and kept staring at memory dumps around failure sites. And there was this funny thing: the fault address was always on the third or fourth line of the memory dumps.

hexdump

This was our only clue, and there are no coincidences when it comes to this sort of byzantine bug. Our memory dumps were of 16 bytes per line and the SIGILL would always happen to be somewhere between 0x40-0x7f or 0xc0-0xff. We aligned the memory dump to help verify whether the code allocator was doing something funky:

$ grep SIGILL *.log
custom_01.log:E/mono           (13964): SIGILL at ip=0x0000007f4f15e8d0
custom_02.log:E/mono           (13088): SIGILL at ip=0x0000007f8ff76cc0
custom_03.log:E/mono           (12824): SIGILL at ip=0x0000007f68e93c70
custom_04.log:E/mono           (12876): SIGILL at ip=0x0000007f4b3d55f0
custom_05.log:E/mono           (13008): SIGILL at ip=0x0000007f8df1e8d0
custom_06.log:E/mono           (14093): SIGILL at ip=0x0000007f6c21edf0
[...]

With that we came to our first good hypothesis: Bad cache flushing was happening only on the upper 64 bytes of every 128-byte block. Those numbers, if you deal with low level programming, immediately remind you of cache line sizes. And that is where it all started to make sense.

Here is a pseudo version of how libgcc does cache flushing on arm64:

void __clear_cache (char *address, size_t size)
{
    static int cache_line_size = 0;
    if (!cache_line_size)
        cache_line_size = get_current_cpu_cache_line_size ();

    for (int i = 0; i < size; i += cache_line_size)
        flush_cache_line (address + i);
}

In the above pseudo-code get_current_cpu_cache_line_size is a CPU instruction that returns the line size of its caches, and flush_cache_line flushes the cache line that contains the supplied address.

At that point we were using our own version of this function, so we instrumented it to print the cache line size as returned by the CPU and, lo and behold, it printed both 128 and 64. We double verified that this was indeed the case. So we went to see that particular CPU manual and it turns out that the big core has a 128 bytes cache line but on the LITTLE core it is only 64 bytes for the instruction cache.

So what was happening is that __clear_cache would be called first on a big core and cache 128 as the instruction cache line size. Later it would be called on one of the LITTLE cores and would skip every other cache line when flushing. It doesn’t get simpler than that. We removed the caching and it all worked.

Summary

Some ARM big.LITTLE CPUs can have cores with different cache line sizes, and pretty much no code out there is ready to deal with it as they assume all cores to be symmetrical.

Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore. Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs. Here is our fix for Mono: Pull Request. Other projects adopted our fix as well already: Dolphin and PPSSPP.