Ftrace: The hidden light switch

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

August 13, 2014

This article was contributed by Brendan Gregg

You may think, as I did, that analyzing the Linux kernel is like venturing through a dark dungeon: without the addition of advanced tracers like SystemTap, there's much that can't be seen, and can only be inferred. However, I've recently found hidden switches that turn on some bright lights, strategically placed by Steven Rostedt and others since the 2.6.27 release. These are the ftrace profilers. I haven't even tried all the switches yet, but I'm stunned at what I've seen so far, and I'm having to rethink what I previously believed about Linux kernel performance analysis.

Recently at Netflix (where I work), a Cassandra database was performing poorly after a system upgrade, and disk I/O inflation (a massive increase in the number of I/O operations submitted) was suspected. There can be many causes for this: a worse cache-hit ratio, record-size inflation, readahead inflation, other applications, even other asynchronous kernel tasks (file system background scrubs). The question was: which one, and how do we fix it?

1. iosnoop

I started with basic server health checks, and then my iosnoop tool. iosnoop is a shell script that uses the /sys/kernel/debug ftrace facilities, and is in my perf-tools collection on GitHub, along with other ftrace hacks. These work on Netflix's existing servers (which often run Linux 3.2 with security patches) without any other addition to the system, and without requiring kernel debuginfo. In this case, iosnoop was run both with and without -Q to see the effect of queuing:

  # ./iosnoop -ts
  STARTs          ENDs            COMM         PID    TYPE DEV      BLOCK        BYTES     LATms
  13370264.614265 13370264.614844 java         8248   R    202,32   1431244248   45056      0.58
  13370264.614269 13370264.614852 java         8248   R    202,32   1431244336   45056      0.58
  13370264.614271 13370264.614857 java         8248   R    202,32   1431244424   45056      0.59
  13370264.614273 13370264.614868 java         8248   R    202,32   1431244512   45056      0.59
  [...]
  # ./iosnoop -Qts
  STARTs          ENDs            COMM         PID    TYPE DEV      BLOCK        BYTES     LATms
  13370410.927331 13370410.931182 java         8248   R    202,32   1596381840   45056      3.85
  13370410.927332 13370410.931200 java         8248   R    202,32   1596381928   45056      3.87
  13370410.927332 13370410.931215 java         8248   R    202,32   1596382016   45056      3.88
  13370410.927332 13370410.931226 java         8248   R    202,32   1596382104   45056      3.89
  [...]

I didn't see anything out of the ordinary: a higher disk I/O load was causing higher queue times.

The tools that follow are all from the same collection: they demonstrate existing capabilities of ftrace, and how they are useful for solving real problems.

2. tpoint

To investigate these disk reads in more detail, I used tpoint to trace the block:block_rq_insert tracepoint:

  # ./tpoint -H block:block_rq_insert
  Tracing block:block_rq_insert. Ctrl-C to end.
  # tracer: nop
  #
  #       TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #          | |       |          |         |
          java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505336 + 88 [java]
          java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505424 + 56 [java]
          java-8248  [007] 13371565.278372: block_rq_insert: 202,32 R 0 () 660621368 + 88 [java]
          java-8248  [007] 13371565.278373: block_rq_insert: 202,32 R 0 () 660621456 + 88 [java]
          java-8248  [007] 13371565.278374: block_rq_insert: 202,32 R 0 () 660621544 + 24 [java]
          java-8249  [007] 13371565.311507: block_rq_insert: 202,32 R 0 () 660666416 + 88 [java]
  [...]

I was checking for anything obviously odd, but the I/O details looked normal. The -H option prints column headers.

Next, I traced the code path that led to this I/O by printing stack traces (-s), to see if they contained an explanation. I also added an in-kernel filter to match reads only (when the "rwbs" flag field contains "R"):

  # ./tpoint -s block:block_rq_insert 'rwbs ~ "*R*"' | head -1000
  Tracing block:block_rq_insert. Ctrl-C to end.
          java-8248  [005] 13370789.973826: block_rq_insert: 202,16 R 0 () 1431480000 + 8 [java]
          java-8248  [005] 13370789.973831: <stack trace>
   => blk_flush_plug_list
   => blk_queue_bio
   => generic_make_request.part.50
   => generic_make_request
   => submit_bio
   => do_mpage_readpage
   => mpage_readpages
   => xfs_vm_readpages
   => read_pages
   => __do_page_cache_readahead
   => ra_submit
   => do_sync_mmap_readahead.isra.24
   => filemap_fault
   => __do_fault
   => handle_pte_fault
   => handle_mm_fault
   => do_page_fault
   => page_fault
          java-8248  [005] 13370789.973831: block_rq_insert: 202,16 R 0 () 1431480024 + 32 [java]
          java-8248  [005] 13370789.973836: <stack trace>
   => blk_flush_plug_list
   => blk_queue_bio
   => generic_make_request.part.50
  [...]

Great! The output is similar to the previous example, but with stack traces beneath each tracepoint event. I limited the output using head as it is verbose.

tpoint is another ftrace-based tool. It's usually better to use perf events (the perf command) for this particular use case, as it can handle multi-user access to performance data and a higher event rate, although it is more time-consuming to use. I just wanted to quickly eyeball a few dozen stack traces for a given tracepoint.

The stacks were mostly the same as the example above, which provided a couple of leads: page faults and readahead. This Ubuntu system was using 2MB direct-mapped pages, instead of 4KB like the old system. It also had readahead set to 2048KB, instead of 128KB. Either of these differences could be causing the inflation, although tuning readahead had already been tested, and found to make no difference.

3. funccount

I wanted to understand that stack trace better, so I started by counting calls using funccount, which uses ftrace function profiling. Starting with the per-second rate of submit_bio():

  # ./funccount -i 1 submit_bio
  Tracing "submit_bio"... Ctrl-C to end.

  FUNC                              COUNT
  submit_bio                        27881

  FUNC                              COUNT
  submit_bio                        28478
  [...]

This rate, about 28,000 calls per second, is on par with what the disks are doing as seen from iostat. funccount is counting events in the kernel for efficiency.

Now checking the rate of filemap_fault(), which is closer in the stack to the database:

  # ./funccount -i 1 filemap_fault
  Tracing "filemap_fault"... Ctrl-C to end.

  FUNC                              COUNT
  filemap_fault                      2203

  FUNC                              COUNT
  filemap_fault                      3227
  [...]

This is consistent with what we believed the application was requesting from the filesystem. There is about a 10x inflation between these calls and the issued disk I/O (as evidenced by the submit_bio() calls).

4. funcslower

Just to confirm that the database is suffering latency caused by the stack trace I was studying, I used funcslower (another ftrace-based tool, which uses in-kernel timing and filtering for efficiency) to measure filemap_fault() calls taking longer than 1000 microseconds (1ms):

  # ./funcslower -P filemap_fault 1000
  Tracing "filemap_fault" slower than 1000 us... Ctrl-C to end.
   0)   java-8210    | ! 5133.499 us |  } /* filemap_fault */
   0)   java-8258    | ! 1120.600 us |  } /* filemap_fault */
   0)   java-8235    | ! 6526.470 us |  } /* filemap_fault */
   2)   java-8245    | ! 1458.30 us  |  } /* filemap_fault */
  [...]

These latencies look similar to those seen from disk I/O (with queue time). I'm in the right area.

5. funccount (again)

I noticed that the stack has "readpage" calls and then "readpages". Tracing them both at the same time:

  # ./funccount -i 1 '*mpage_readpage*'
  Tracing "*mpage_readpage*"... Ctrl-C to end.

  FUNC                              COUNT
  mpage_readpages                     364
  do_mpage_readpage                122930

  FUNC                              COUNT
  mpage_readpages                     318
  do_mpage_readpage                110344
  [...]

Here's our inflation: mpage_readpages() is being called about 300 times per second, and then do_mpage_readpage() over 100k times per second. This still looks like readahead, although we did try to adjust readahead sizes as an experiment, and it didn't make a difference.

6. kprobe

Maybe our readahead tuning didn't take effect? I can check this using dynamic tracing of kernel functions. Starting with the above stack trace, I saw that __do_page_cache_readahead() has nr_to_read (number of pages to read) as an argument, which comes from the readahead setting. Using kprobe, an ftrace- and kprobes-based tool, to dynamically trace this function and argument:

  # ./kprobe -H 'p:do __do_page_cache_readahead nr_to_read=%cx'
  Tracing kprobe m. Ctrl-C to end.
  # tracer: nop
  #
  #   TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #      | |       |          |         |
      java-8714  [000] 13445354.703793: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8716  [002] 13445354.819645: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8734  [001] 13445354.820965: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8709  [000] 13445354.825280: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
  [...]

I used -H to print the header, and p: to specify that we will create a probe on function entry, which we'll call "do" (that alias is optional). The rest of that line specifies the function and optional arguments. Without kernel debuginfo, I can't refer to the nr_to_read symbol, so I need to use registers instead. I guessed %cx: If true, then our tuning hasn't taken hold, as 200 in hex is 512 pages: the original 2048KB.

7. funcgraph

To be sure, I read the code to see how this value is passed around, and used funcgraph to illustrate it:

  # ./funcgraph -P filemap_fault | head -1000
   2)   java-8248    |               |  filemap_fault() {
   2)   java-8248    |   0.568 us    |    find_get_page();
   2)   java-8248    |               |    do_sync_mmap_readahead.isra.24() {
   2)   java-8248    |   0.160 us    |      max_sane_readahead();
   2)   java-8248    |               |      ra_submit() {
   2)   java-8248    |               |        __do_page_cache_readahead() {
   2)   java-8248    |               |          __page_cache_alloc() {
   2)   java-8248    |               |            alloc_pages_current() {
   2)   java-8248    |   0.228 us    |              interleave_nodes();
   2)   java-8248    |               |              alloc_page_interleave() {
   2)   java-8248    |               |                __alloc_pages_nodemask() {
   2)   java-8248    |   0.105 us    |                  next_zones_zonelist();
   2)   java-8248    |               |                  get_page_from_freelist() {
   2)   java-8248    |   0.093 us    |                    next_zones_zonelist();
   2)   java-8248    |   0.101 us    |                    zone_watermark_ok();
   2)   java-8248    |               |                    zone_statistics() {
   2)   java-8248    |   0.073 us    |                      __inc_zone_state();
   2)   java-8248    |   0.074 us    |                      __inc_zone_state();
   2)   java-8248    |   1.209 us    |                    }
   2)   java-8248    |   0.142 us    |                    prep_new_page();
   2)   java-8248    |   3.582 us    |                  }
   2)   java-8248    |   4.810 us    |                }
   2)   java-8248    |   0.094 us    |                inc_zone_page_state();
  [...]

funcgraph uses another ftrace feature: the function graph profiler. It has moderate overhead, since it traces all kernel functions, so I only use it for exploratory purposes like this one-off. The output shows the code-flow in the kernel, and even has time deltas in microseconds. It's the call to max_sane_readahead() that is interesting, as that fetches the readahead value it wants to use.

8. kprobe (again)

This time I'll trace the return of the max_sane_readahead() function:

  # ./kprobe 'r:m max_sane_readahead $retval'
  Tracing kprobe m. Ctrl-C to end.
      java-8700  [000] 13445377.393895: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8723  [003] 13445377.396362: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8701  [001] 13445377.398216: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8738  [000] 13445377.399793: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8728  [000] 13445377.408529: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
  [...]

This is also 0x200 pages: 2048KB, and this time I used the $retval alias instead of guessing registers. So the tuning really did not take effect. Studying the kernel source, I saw that the readahead property was set by a function called file_ra_state_init(). Under what circumstances is that called, and how do I trigger it? ftrace/kprobes to the rescue again:

  # ./kprobe -s p:file_ra_state_init
  Tracing kprobe m. Ctrl-C to end.
            kprobe-20331 [002] 13454836.914913: file_ra_state_init: (file_ra_state_init+0x0/0x30)
            kprobe-20331 [002] 13454836.914918: <stack trace>
   => vfs_open
   => nameidata_to_filp
   => do_last
   => path_openat
   => do_filp_open
   => do_sys_open
   => sys_open
   => system_call_fastpath
            kprobe-20332 [007] 13454836.915191: file_ra_state_init: (file_ra_state_init+0x0/0x30)
            kprobe-20332 [007] 13454836.915194: <stack trace>
   => vfs_open
   => nameidata_to_filp
  [...]

This time I used -s to print stack traces, which showed that this often happens from the open() syscall. As I'd left Cassandra running while tuning readahead, it may not have reopened its files and run file_ra_stat_init(). So I restarted Cassandra to see if the readahead tuning would then take effect, and re-measured:

  # ./kprobe 'r:m max_sane_readahead $retval'
  Tracing kprobe m. Ctrl-C to end.
      java-11918 [007] 13445663.126999: m: (ondemand_readahead+0x3b/0x230 <- \
   			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.128329: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.129795: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.131164: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
  [...]

Success!

iostat showed a large drop in disk I/O, and the database latency measurements were much better. This was simply a readahead change, where the new Ubuntu instances defaulted to 2048KB. What had misled us earlier was that tuning it had not appeared to make a difference, as the setting wasn't taking effect.

Another tunable we checked was the I/O scheduler, and whether changing it from "deadline" to "noop" was immediate. Using:

  ./funccount -i 1 'deadline*'
  ./funccount -i 1 'noop*'

gave us the answer: they showed the rate of the related kernel functions, and that the tuning was indeed immediate.

ftrace and perf-tools

All the tools I used here are from my perf-tools collection, which are front-ends to ftrace and related tracers (kprobes, tracepoints, and the function profiler). I've described some of them as hacks, which they are, as they use creative workarounds for the lack of some in-kernel features. For example, iosnoop reads both issue and completion events in user space, and calculates the latencies there, instead of doing that more efficiently in kernel context.

These tools are for older Linux systems without kernel debuginfo, and show what Linux can do using only its built-in ftrace. There's even more to ftrace that I didn't show here: profiling function average latency, tracing wakeup events, function probe triggers, etc. There is also Steven Rostedt's own front-end to ftrace, a multi-tool called trace-cmd (covered on LWN in 2010), which can do more than my collection of smaller tools, and is also much easier to use than operating on the /sys files directly.

To do even more, perf_events (which is also part of the Linux kernel source) with kernel debuginfo lets me examine complex data structures, and even trace kernel code by line number and watch local variables. What I still can't do, however, is perform complex in-kernel aggregations until I add other advanced tracers like SystemTap, ktap, or coming up: eBPF.

Future Work

Changes are on the way for the Linux kernel, as regular LWN readers likely know. The capabilities I need, to avoid using hacks, may be provided by eBPF, which may be merged as soon as Linux 3.18. If that happens, I'll be happy to create new versions of these tools making use of eBPF. They will likely look and feel the same, but their implementation will be much more efficient and reliable. I'll also be able to create many more tools without the fear of maintaining too many hacks.

Although, even if Linux does bring these capabilities in an upcoming release, it’ll be some time before I can really use them in production in my work environment, depending on how quickly we can or want to be on the latest kernel. So, while my hacks are temporary workarounds, they may be useful for some time to come.

Conclusion

We have a large number of Linux cloud instances to analyze at Netflix, and some interesting and advanced performance issues to solve. While we've been looking at using advanced tracers like SystemTap, I've also been studying what's there already, including the ftrace profilers. The scripts I showed in this post use ftrace, so they work on our our existing instances as-is, without even installing kernel debuginfo.

In this particular example, a Cassandra database experienced a disk I/O inflation issue caused by an increased readahead setting. I traced disk I/O events and latencies, and how these were created in the kernel. This included examining stack traces, counting function-call rates, measuring slow function times, tracing call graphs, and dynamic tracing of function calls and returns, with their arguments and return values.

I did all of this using ftrace, which has been in the Linux kernel for years. I found the hidden light switches.

If you are curious about the inner workings of these ftrace tools, see "Secrets of the Ftrace function tracer" by Steven Rostedt, "Debugging the kernel using Ftrace" part 1 and part 2, and Documentation/trace/ftrace.txt in the kernel source. The most important lesson here isn't about my tools, but that this level of tracing is even possible on existing Linux kernels. And if eBPF is added, a lot more will be possible.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Ftrace
Kernel	Tracing/Ftrace
GuestArticles	Gregg, Brendan

(Log in to post comments)

Ftrace: The hidden light switch

Posted Aug 17, 2014 19:45 UTC (Sun) by bernat (subscriber, #51658) [Link]

Great article!

Unfortunately, unlike Ubuntu kernels, Debian kernels come with most useful tracers disabled. This seems due to this bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=568025

I have opened a bug report to ask for them to be enabled again:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=758469

With the fact that Debian does not come with debug symbols for all packages (something that I hope to address in the future), this makes Debian a bit more unfriendly with respect to all those performance tools, compared to Ubuntu.

Ftrace: The hidden light switch

Posted Aug 25, 2014 9:11 UTC (Mon) by Shugyousha (subscriber, #93672) [Link]

A great article indeed.

There is a lot of tracing infrastructure in the Kernel and a lot of tools for tracing available. It would be nice if we could get an(other?) article comparing the different tools and their capabilities. Candidates for such a comparison would be SystemTap, ktap, eBPF, the Kernel perf tool and maybe the newer sysdig (http://www.sysdig.org/) for example.

Ftrace: The hidden light switch

Posted Sep 12, 2014 6:45 UTC (Fri) by bgregg (guest, #46639) [Link]

I'd like to write a comparative article at some point, although it is challenging to compare them all, since there's so many factors to take into account. And I'd like to do more research first, such as using SystemTap's non-debuginfo features more (which are practically undocumented, but should be an important factor for real world usage), spend more time with LTTng, try out the ftrace function triggers, try writing a kernel module for tracing based on samples/kprobes/*.c (has no one done this? I can find zero examples other than the kernel source), etc.

The best place for tracing system comparisons has, historically, been here on lwn.net, like the ktap vs eBPF article, and others by Jonathan Corbet. So, keep reading... :-)

Ftrace: The hidden light switch

Posted Mar 3, 2015 9:24 UTC (Tue) by ituralde (guest, #101279) [Link]

Is it possible to use perf-tools (or merely ftrace of perf_events) to profile functions of kernel modules? By modules I mean once that are inserted to a stable kernel.
The Q might be silly, I am quite a novice in this matter.