BPF at Facebook (and beyond)

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
October 10, 2019

Kernel Recipes

It is no secret that much of the work on the in-kernel BPF virtual machine and associated user-space support code is being done at Facebook. But less is known about how Facebook is actually using BPF. At Kernel Recipes 2019, BPF developer Alexei Starovoitov described a bit of that work, though even he admitted that he didn't know what most of the BPF programs running there were doing. He also summarized recent developments with BPF and some near-future work.

Kernels at Facebook

Facebook, he began, has an upstream-first philosophy, taken to an extreme; the company tries not to carry any out-of-tree patches at all. All work done at Facebook is meant to go upstream as soon as it practically can. The company also runs recent kernels, upgrading whenever possible. The company can move to a new kernel in a matter of days; this process could be faster, he said, except that it still takes some time to reboot thousands of servers. As of just before the talk, most of the Facebook fleet was running 4.16, with a few 4.11 machines hanging around and some at 5.2.

He pointed out the lack of long-term-support kernels in the above list. Facebook does not plan to stay with any given kernel for a long time, so the company doesn't care about long-term support. Instead, machines are simply upgraded to whichever kernel is available. Within a given version, though, there can be a fair amount of variation across the fleet; the kernel team evidently backports features into older kernels when the need arises. That can create challenges for applications and, especially, BPF-based applications.

The first rule of kernel development is "don't break user space"; anything that might cause a user-space program to fail becomes part of the kernel ABI. Performance regressions are included in this rule. Performance problems are easy to create, so Facebook needs a team to track them down. Often, it seems, BPF is fingered as the cause of these problems.

Starovoitov asked the audience to guess how many BPF programs were running on their laptops, then to run this command:

    ls -la /proc/*/fd | grep bpf-prog | wc -l

The answer on your editor's system is six, all running from systemd. He was surprised by the answer at Facebook: there are about 40 BPF programs running on each server, with another 100 that are demand loaded. There are many teams within the company writing and deploying these programs; the kernel team doesn't even know about all of these BPF programs. These programs are about evenly split between those attached to kprobes, those attached to tracepoints, and network scheduling-class helpers; about 10% fall into other categories.

He gave a few examples of performance issues that, at least on the surface, were caused by BPF:

Facebook runs a packet-capture daemon that makes use of a network scheduling-class BPF program; it occasionally spits out a packet for inspection. On new kernels, running that daemon regressed overall system performance by about 1%. It turns out that this daemon uses another BPF program, attached to a kprobe, for a different purpose. The function that probe attaches to didn't exist in the newer kernel, causing the daemon to conclude that BPF as a whole was broken; it then fell back to an older, slower method for packet capture. Kprobes are not a stable ABI, Starovoitov said, but when kernel developers change a function kprobe usage can still require somebody to investigate the resulting breakage.
The number-one performance-analysis tool at Facebook is a profiling daemon that attaches BPF programs to tracepoints and kprobes in the scheduler and beyond. On new kernels, it caused a 2% performance regression, manifesting as an increase in software-interrupt time. It turns out that, in the 5.2 kernel, setting a kprobe causes the text section to be remapped from 2MB huge pages to normal 4KB pages, with a resulting increase in TLB misses and decrease in performance.
There is a security monitoring daemon that sets BPF programs on three kprobes and one kretprobe. It runs at low priority, waking up every few seconds and consuming about 0.01% of the CPU. This daemon was causing huge latency spikes for the high-priority database application. Some tracing work showed that, on occasion, a memcpy() call in the database could stall for as much as 1/4 second while this daemon was reading its /proc/pid/environ file. Much more tracing showed that this daemon was acquiring the mmap_sem lock when reading that /proc file, then being scheduled out for long periods of time, blocking page faults in the main application. The root cause was a basic priority-inversion issue; raising the security daemon's priority prevents this problem.

The takeaway from all of these episodes — and especially the last one — is that the best tool for tracking down BPF-related performance regressions is BPF.

Current and future BPF improvements

Another kind of problem results from how BPF programs are built. A user-space application will contain one or more BPF programs to be loaded into the kernel. These programs are written in C and compiled to the BPF virtual machine instruction set; this compilation happens on the target system. To ensure that the compilation can be done consistently, a version of the LLVM compiler is embedded in the application itself. This makes the applications big, and the compilation process can perturb the main workload on the target system. The compilation can also take a long time, since it is done at a low priority; several minutes to compile a 100-line program is not unusual. The system headers needed to understand kernel data structures may be missing from the target system, creating compilation failures. It is a pain point, he said.

The solution to this problem is to be found in the "compile once, run everywhere" work that reached a milestone with the 5.4 kernel. It uses the BPF type format (BTF) data describing kernel data structures that was created for just this purpose. With BTF provided by the kernel, there is no longer any need to ship kernel headers around; instead, the bpftool utility just extracts the BTF data and creates a "monster header file" on the target system. An LLVM built-in function has been added to preserve the offsets into structures used by BPF programs; those offsets are then "relocated" at load time to match the version of the structure used in the target kernel.

A number of other interesting projects have made progress in 2019, he said. Support for bounded loops in the verifier was added to 5.3 after two years of work. BPF programs can now manage concurrency with spinlocks, with the verifier proving that these programs will not deadlock. Dead-code elimination has been added, and scalar precision tracking as well.

Starovoitov said that people often complain that the BPF verifier is painful to deal with. But, he said, it is far smarter than the LLVM compiler, and a number of advantages come from that, starting with the ability to prove that a program is safe to load into the kernel. The verifier is also able to perform far better dead-code elimination than LLVM can.

In the future, the verifier is set to get better by making more use of the available BTF data. Every program type, for example, must implement its own boilerplate functions to provide (and check) access to the context object passed to the programs themselves. This code bloats the kernel, he said, and tends to be prone to bugs. With BTF, those functions will no longer be necessary; the verifier can use the BTF data to check programs directly. That will enable the removal of 4,000 lines of code, he said.

He concluded by saying that BPF development is "100% driven by use cases"; the way to shape its future direction is to show the ways in which new features can be useful. Even better, of course, is to hack new extensions and to share them with the community.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Index entries for this article
Kernel	BPF
Conference	Kernel Recipes/2019

(Log in to post comments)

BPF at Facebook (and beyond)

Posted Oct 16, 2019 17:31 UTC (Wed) by nilsmeyer (guest, #122604) [Link]

It would be very interesting to hear why they run those specific kernel versions and prefer manual backports vs. tracking upstream.

BPF at Facebook (and beyond)

Posted Oct 18, 2019 5:51 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

I can't answer for Facebook, but I know why _we_ do those things:

We have QA that tests kernels with production workload, and rejects upgrades that are obviously worse than the current production kernel version. Nothing gets to production operations without QA signing off.

QA testing and operations data both produce metrics like "4.14 kernels are crashing 5x more often or running 30% slower than the 4.13 kernels did before." There would have to be something horribly and provably wrong with the 4.13 kernel that can *only* be fixed by upgrading to 4.14 to proceed with a fleetwide kernel upgrade in spite of those results.

We use backports to select critical fixes for production without simultaneously introducing a whole new world of problems. We maintain local backports on each upstream release version that is newer than our current production kernel, and test them continually. We also work with upstream to get bugs reported and fixed, but that process can take months or years (QA can't tell us _why_ there are more crashes, we have to figure that out ourselves). Upstream kernels may be released with bugs still present while that happens. Even when fixes finally land upstream, new bugs have been added in the meantime, starting a new upstream bug discover-fix-merge cycle overlapping the previous one.

Most of the time we can't run current upstream kernels in production while they are current. Sometimes we can upgrade to an upstream version a few months after release if we've collected enough backported fixes to pass QA, but other times the candidate upstream version never passes QA because we abandon it for a later upstream version that does.

When we find an upstream kernel that has "only a few" bugs that we can fix with local backports, we rebase development for production use on that upstream kernel (dependencies on new kernel features are allowed, older backport branches are pruned, legacy workarounds are deprecated, etc). This is not a predictable event. We've had as many as 8 upstream kernel releases with backports trying to pass QA before the 9th upstream release finally succeeds. Other times there are as many as 4 usable upstream releases in a row.

Each new Linux kernel release adds new bugs and fixes for older bugs in roughly equal numbers. Long term analysis of past bug fixes finds that historically there were always hundreds of bugs that could potentially affect our workload in every Linux kernel release. Roughly half of upstream Linux kernel releases fail quickly in QA testing even with stable and local backports. None of the vx.y.0 tagged versions (without any backports) have passed QA since our records started in 2008. LTS kernels have the same random quality as the ordinary kernel releases: either they eventually accumulate enough backports to pass QA, or they get abandoned because a newer and better regular upstream release comes along.

Different workloads select for different kernel versions, so everyone who follows this approach ends up with a different list of specific kernel versions and backports. I'm not surprised by Facebook's kernel choices of 3 kernels spanning 3 years--it's a similar distribution to what we get from testing. When we compare our list with lists produced by other shops and Linux distro kernels, we find many of our kernels can't pass their QA or run their workloads, and many of their kernels can't pass or run ours. The first rule of QA, "whatever you're not testing doesn't work," seems to hold true.

BPF at Facebook (and beyond)

Posted Oct 24, 2019 16:49 UTC (Thu) by topimiettinen (guest, #133428) [Link]

Dear Editor,

I have 156 BPF programs, all running from systemd. Six means that you are not using systemd's numerous hardening features. I'd kindly propose you and others with similar low number run 'systemd-analyze security' and try the suggested directives. Of course features might not work for a service and some problems may materialize only after a while, but many will work.

-Topi Miettinen

BPF at Facebook (and beyond)

Posted Oct 24, 2019 17:04 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Shouldn't the directive applications be pushed upstream? Are there any distribution efforts underway for coordination across widely-available services?

BPF at Facebook (and beyond)

Posted Oct 25, 2019 10:22 UTC (Fri) by topimiettinen (guest, #133428) [Link]

Of course and I have tried to push them for a few services with various levels of success. My limited experience is that upstreams may be a bit too cautious in fear of breaking the downstreams with too much hardening, or they think that this is a task for distros. Some distro maintainers instead want upstream to handle this. I'm not aware of any coordination.

BPF at Facebook (and beyond)

Posted Oct 25, 2019 12:50 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I think starting with the obvious ones is probably a better route. PrivateNetworking= seems like an easy one for things like `alsa-state.service` or `upower.service`. I don't know how easy it'd be to get a team together from distributions to look at this kind of stuff. But, being security related, knowing the software itself and the controls well enough to know which apply seems like a small intersection :( .

BPF at Facebook (and beyond)

Posted Oct 29, 2019 23:23 UTC (Tue) by Brane212 (guest, #135235) [Link]

Is there a good explanation why vurtaul machine instead of native execution has to be used ?

Surely, whatever needs to run as a filter would run faster native than through an emulation layer ?