A new kernel polling interface

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
January 9, 2018

Polling a set of file descriptors to see which ones can perform I/O without blocking is a useful thing to do — so useful that the kernel provides three different system calls (select(), poll(), and epoll_wait() — plus some variants) to perform it. But sometimes three is not enough; there is now a proposal circulating for a fourth kernel polling interface. As is usually the case, the motivation for this change is performance.

On January 4, Christoph Hellwig posted a new polling API based on the asynchronous I/O (AIO) mechanism. This may come as a surprise to some, since AIO is not the most loved of kernel interfaces and it tends not to get a lot of attention. AIO allows for the submission of I/O operations without waiting for their completion; that waiting can be done at some other time if need be. The kernel has had AIO support since the 2.5 days, but it has always been somewhat incomplete. Direct file I/O (the original use case) works well, as does network I/O. Many other types of I/O are not supported for asynchronous use, though; attempts to use the AIO interface with them will yield synchronous behavior. In a sense, polling is a natural addition to AIO; the whole point of polling is usually to avoid waiting for operations to complete.

The patches add a new command (IOCB_CMD_POLL) that can be passed in an I/O control block (IOCB) to io_submit() along with any of the usual POLL* flags describing the type of I/O that is desired — POLLIN for data available to read, for example. This command, like other AIO commands, will not (necessarily) complete before io_submit() returns. Instead, when the indicated file descriptor is ready for the requested type of I/O, a completion event will be queued. A subsequent call to io_getevents() (or the io_pgetevents() variant, added by the patch set, that blocks signals during the operation) will return that event, and the calling application will know that it can perform I/O on the indicated file descriptor. AIO poll operations always operate in the "one-shot" mode; once a poll notification has been generated, a new IOCB_CMD_POLL IOCB must be submitted for that file descriptor if further notifications are needed.

Thus far, this interface sounds more difficult to use than the existing poll system calls. There is a payoff, though, that comes in the form of the AIO ring buffer. This poorly documented aspect of the AIO subsystem maps a circular buffer into the calling process's address space. That process can then consume notification events directly from the buffer rather than calling io_getevents(). Multiple notifications can be consumed without the need to enter the kernel at all, and polling for multiple file descriptors can be re-established with a single io_submit() call. The result, Hellwig said in the patch posting, is an up-to-10% improvement in the performance of the Seastar I/O framework. More recently, he noted that the improvement grows to 16% on kernels with page-table isolation turned on.

Internally to the kernel, any device driver (or other subsystem that exports a file_operations structure) can support the new poll interface, but some small changes will be required. It is not, however, necessary to support (or even know about) AIO in general. In current kernels, the polling system calls are all supported by the poll() method in struct file_operations:

    int (*poll) (struct file *file, struct poll_table_struct *table);

This function must perform two actions: setting up notifications for when the underlying file is ready for I/O, and returning the types of I/O that could be performed without blocking now. The first is done by adding one or more wait queues to the provided table; the driver will perform a wakeup call on one of those queues when the state of the device changes. The current readiness state is the return value from the poll() method itself.

Supporting AIO-based polling requires splitting those two functions into separate file_operations methods. Thus, there are two new entries to that structure:

    struct wait_queue_head *(*get_poll_head)(struct file *file, int mask);
    int (*poll_mask) (struct file *file, int mask);

(The actual patches use the new typedef __poll_t for the mask, but that typedef isn't in the mainline kernel yet). The polling subsystem will call get_poll_head() to obtain a pointer to the wait queue that will be notified when the device's I/O readiness state changes; poll_mask() will be called to get the current readiness state. A driver that implements these two operations need not (and probably should not) retain its implementation of the older poll() interface.

One potential limitation built into this API is that there can only be a single wait queue that receives notifications for a given file. The current interface, instead, allows multiple queues to be used, and a number of drivers take advantage of that fact to use, for example, different queues for read and write readiness. Contemporary wait queues offer enough flexibility that the use of multiple queues should not be necessary anymore. If a driver cannot be changed, Hellwig said, "the driver just won't support aio poll"

There have not been a lot of comments in response to the patch posting so far; many of the relevant developers have been preoccupied with other issues in the last week. It is hard to argue with a 10% performance improvement, though, so some form of this patch seems likely to get into the mainline sooner or later — interested parties can keep checking the mainline repository to see if it's there yet. Whether we'll see a fifth polling interface added in the future is anybody's guess, though.

Index entries for this article
Kernel	Asynchronous I/O
Kernel	poll()

(Log in to post comments)

A new kernel polling interface

Posted Jan 9, 2018 22:09 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

At least QEMU is certainly interested in this! Being able to poll the ring buffer for socket events would be a useful addition indeed.

Similar interface for futex?

Posted Jan 9, 2018 23:40 UTC (Tue) by Nagarathnam (guest, #116887) [Link]

I guess even futex could use something similar. An additional flag passed along with FUTEX_WAIT could label it as an asynchronous wait and when the futex is available, the event could be directly queued on an epoll fd registered during FUTEX_WAIT.

Similar interface for futex?

Posted Jan 10, 2018 6:14 UTC (Wed) by helge.bahmann (subscriber, #56804) [Link]

Or, turn this around completely and funnel readiness notifications to futex... shameless plug: Extending futex for Kernel to User Notification

Similar interface for futex?

Posted Jan 12, 2018 0:14 UTC (Fri) by Nagarathnam (guest, #116887) [Link]

Interesting. Has the patch been sent out for review?

A new kernel polling interface

Posted Jan 10, 2018 0:06 UTC (Wed) by pj (subscriber, #4506) [Link]

Removing a kernel/userspace transition is quite a big win. Glad to see it, and I wonder where else it might be applicable.

A new kernel polling interface

Posted Jan 10, 2018 0:29 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

AIO polling is fine, but wouldn't it be better still to just asynchronously perform the IO operation that you're going to perform anyway once the AIO poll indicates readiness?

netlink ring buffer

Posted Jan 10, 2018 1:20 UTC (Wed) by shemminger (subscriber, #5739) [Link]

There was a netlink ring mmap ring buffer in earlier kernel versions, but it was removed because of security issues. Using a ring buffer is hard for requests because check/use issues.

netlink ring buffer

Posted Jan 10, 2018 19:32 UTC (Wed) by stefanha (subscriber, #55072) [Link]

It's worth auditing the fs/aio.c ring buffer but I don't think the ring buffer is a bad thing per se.

The Linux AIO ring buffer has been upstream since pre-git history. It is not a new attack surface and this patch series doesn't even appear to touch the ring buffer code.

Ring buffers are used across other security boundaries like cpu<->hardware and VM<->hypervisor so they can be implemented in a secure fashion. Just as with syscalls, it's important to copy in untrusted data before validating it. Ring buffer producers and consumers typically maintain their own state that is not accessible across the trust boundary. They publish their internal state (e.g. ring indices) to the ring and fetch the other side's state from the ring, but that is secure.

netlink ring buffer

Posted Jan 11, 2018 12:37 UTC (Thu) by rvolgers (subscriber, #63218) [Link]

It's worth noting that the AIO ring buffer has had significant security issues in the past. It's just that apparently this interface is important enough that it got fixed instead of removed.

A new kernel polling interface

Posted Jan 10, 2018 10:33 UTC (Wed) by mezcalero (subscriber, #45103) [Link]

Hmm, so what I always find puzzling about all those poll()-like interfaces: the more fds you start to handle the more likely it is you need to *priorize* handling some of them. And none of the kernel interfaces really helps you with that, which generally means userspace then has to add another complex layer in front of it, to order triggered event sources by their priority. glib does that, and so does systemd's sd-event. Yes, with epoll one can build a hierarchy of priorities by giving each priority class its own epoll fd an then nest them, but this is complex, nasty and increases the number of syscalls one has to do in each iteration.

I think it would be quite good if kernel folks designing those interfaces would have a look at what userspace actually does with those APIs and then make things less awful to use, because quite frankly, all of select(), poll(), ppoll(), epoll are just plain terrible, just to different levels. Have a look how glib or systemd's sd-event end up handling priorization (or guarantee event ordering) or hook up waitid() to event loops, it's terrible the choices one has to make there. All the great optimizations that epoll supposedly permits, and this aio stuff will permit too are so entirely useless if in this iteration again it all comes crashing down as this only works in synthetic, very specific test cases, and not for any of the generic event loops that are used in userspace IRL.

I couldn't care less about yet another API for all of this, even if it reduces the number of syscalls in niche cases even further, if we can't get the basic stuff done properly before. I'd much rather have a safe childfd() concept, and guaranteed event ordering/priorization in epoll before anything else. Or just a usable inotify() or fanotify() would be great...

Lennart

A new kernel polling interface

Posted Jan 10, 2018 18:42 UTC (Wed) by zyga (subscriber, #81533) [Link]

Do you think the CLONE_FD patches that were circulating earlier have a chance of providing a childfd-like API?
For reference: https://lwn.net/Articles/636646/

A new kernel polling interface

Posted Jan 10, 2018 21:07 UTC (Wed) by flussence (subscriber, #85566) [Link]

Maybe the kernel could look to Windows for inspiration? People seem to tolerate the async API there. Would be a nice bonus if we get something that Wine can benefit from too...

A new kernel polling interface

Posted Jan 10, 2018 21:16 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Windows NT kernel is inherently asynchronous. All the major operations (barring aberrations like FastIO) are structured as filters or sinks for IRPs that can go through layers asynchronously.

Additionally, userspace API in Windows (overlapped IO) is designed decidedly better than epoll.

A new kernel polling interface

Posted Jan 18, 2018 9:42 UTC (Thu) by HelloWorld (guest, #56129) [Link]

What's wrong with inotify and fanotify?

A new kernel polling interface

Posted Feb 6, 2018 4:15 UTC (Tue) by fest3er (guest, #60379) [Link]

I was just wondering this myself. A form of inotify in which the kernel tells the user which FD is addressed and what its new state is: ready-to-read, ready-to-write, read-closed, write-closed, closed, read-no-network, write-no-network, etc. Call it fdnotify.

This could solve the problem of inotify-wait never exiting because it will never again write to its socket/pipe to a userspace program, and thus will never detect that the reader of the pipe has gone away; the only way to detect that the reader end of a pipe is gone is to write to the pipe. Proof? Hot-plug a drive. Run inotify-wait looking for that /dev node to be deleted, and pipe it to a shell script that waits for the drive to be unplugged. When you unplug the drive, inotify-wait tells the script which then continues its processing and exits. But the inotify-wait program sits there forever because the file it was watching for deletion has been deleted and can never be deleted again; and because it will never receive another notice of the file's deletion, it will never again write to the pipe and, thus, it won't detect that the shell script (the pipe's reader) is gone. It's a deficiency in Linux. (And no, if you re-connect the drive and unplug it again, the comatose inotify-wait wakes not, because the /dev node that instance of inotify-wait is watching no longer exists and will never again exist.)

Polling is always wasteful. Even when there're no other options. So have a thread that reads the fdnotify FD, a thread that reads the inotify FD, a thread that reads the eventloop pipe FD, a thread that handles timeouts. Have each feed the dispatcher. No more polling. Action occurs only when something happens. Data transfer on the fdnotify FD should be small: 32 bits for the FD and 32 bits for the reason.

A new kernel polling interface

Posted Jun 27, 2018 16:30 UTC (Wed) by ibukanov (subscriber, #3942) [Link]

To detect that the read end of the pipe was closed, poll/epoll the write end of the pipe. This will report POLLERR/EPOLLERR when that happens.

A new kernel polling interface

Posted Jan 19, 2018 19:17 UTC (Fri) by davmac (guest, #114522) [Link]

> I'd much rather have a ... guaranteed event ordering/priorization in epoll before anything else

I don't see a lot of benefit to moving prioritisation into the kernel. You have to pick a particular priority model (eg. assign all event sources a fixed numerical priority) and if the userspace needs a more complex model (weighting based on latency, or whatever) then you're pretty much back to square one anyway.

As it is, userspace has to maintain a queue of events, and the kernel just provides the events that go into the queue (this is assuming of course that you even need priority levels for different event sources, and that's not always the case). That's a pain, but you generally don't have to actually do it yourself - by which I mean, there are plenty of event loop libraries now which handle this for you.

The notion that the kernel itself needs to provide an API that is straightforward and generally usable from applications is flawed. It's much better to have flexibility than a rigid policy in the kernel, when any inherent complexity can always be hidden behind a library layer.

OTOH I completely agree it would be nice if there was a decent, reliable, non-signal way of watching process status via a file descriptor rather than the mess of listening for signals.

A new kernel polling interface

Posted Jan 19, 2018 20:19 UTC (Fri) by excors (subscriber, #95769) [Link]

> I don't see a lot of benefit to moving prioritisation into the kernel. You have to pick a particular priority model (eg. assign all event sources a fixed numerical priority) and if the userspace needs a more complex model (weighting based on latency, or whatever) then you're pretty much back to square one anyway.

Perhaps userspace could provide an eBPF program that implements a partial order over events, and the kernel can use that to do a topological sort.

A new kernel polling interface--EBF

Posted Feb 23, 2018 0:25 UTC (Fri) by vomlehn (guest, #45588) [Link]

Implementing priorities with EBF and a partial sort is an interesting thought. The issue of prioritization is a policy issue and thus something that generally belongs in user space. If you can't actually implement it in user space, however, sending in a proxy in the form of EBF seems safe and, one hopes, reasonably fast. Still, you may be handling a huge number of file descriptors and a design that only re-evaluates priorities when necessary. I've worked with companies that use a huge number of file descriptors in the past but I mostly do embedded systems these days and would not want to speak on what their requirements actually are.

What about edge-triggered notification?

Posted Jan 10, 2018 11:10 UTC (Wed) by sasha (guest, #16070) [Link]

It is not clear if this new interface supports something like epoll edge-triggered notification. I.e. I know that this socket is readable, and I do not want to read the 1 byte it has in the receive queue, but I want to be notified when the socket receives more data. Is it possible with this new API?

What about edge-triggered notification?

Posted Jan 10, 2018 19:24 UTC (Wed) by pbonzini (subscriber, #60935) [Link]

It seems like it's level-triggered only.

What about edge-triggered notification?

Posted Jan 19, 2018 19:03 UTC (Fri) by davmac (guest, #114522) [Link]

> It seems like it's level-triggered only.

From the article:

> AIO poll operations always operate in the "one-shot" mode

While "one-shot" isn't precisely the same thing as edge-triggered, they can largely be used with similar effect. If you want edge triggering and you have one-shot, you can arm a level-triggered one-shot listener and it will fire either immediately on the next "up" edge. Your application is in control of the "down" edge (i.e. you read all the data from the socket until you receive EAGAIN) and if you re-arm after that point, you effectively get notified of the next "up" edge in the same way that edge-triggered notification would.

The main differences are that (a) you have to explicitly re-arm and (b) you won't get extra notifications if you happen to get two edges while processing (i.e. if you drain all data from the socket but more comes in before you do another read and notice that the buffer is empty). The (a) point is a down-side, but (b) is pretty much essential if you want to poll for events from multiple threads, since you can otherwise end up with more than one thread trying to service the same active connection.

What about edge-triggered notification?

Posted Aug 27, 2018 23:50 UTC (Mon) by ncm (guest, #165) [Link]

This sounds inherently racy. With edge-triggered hardware interrupts, you can register your interest while interrupts are still blocked, and if something happens between finishing and unblocking interrupt, you get interrupted again.

What about edge-triggered notification?

Posted Aug 29, 2018 9:23 UTC (Wed) by farnz (subscriber, #17727) [Link]

You can set that up with level-triggered notifications; use a software mutex mechanism to stop things happening in parallel on one event source if necessary, then ask for a new notification immediately upon receiving one (or at a good point in your processing of incoming events), and you will be notified of new events that happen while you're processing the older ones.

This is basically how the IRQ controller on hardware that resignals edge-triggered interrupts that came in while interrupts are blocked works - bear in mind that on some (older) IRQ controllers, edge-triggered interrupts that came in while interrupts are still blocked would simply be lost by the hardware due to the race between unblocking and the interrupt arriving.

What about edge-triggered notification?

Posted Jan 10, 2018 19:28 UTC (Wed) by pbonzini (subscriber, #60935) [Link]

I admit I have never understood edge-triggered polling very well. You need ioctl(FIONREAD) to know how many bytes you have in the receive queue, why not read it immediately? (Also, the kernel connection multiplexer can help avoiding this too).

What about edge-triggered notification?

Posted Jan 11, 2018 15:05 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

Doing it this way leaves the buffering in the kernel; if you know that you can't do anything with the received data until at least 'x' bytes have arrived (packet headers, checksums, etc.) then someone has to buffer until that many bytes have arrived. Letting the kernel do it keeps the userspace program from being complicated with an extra layer of buffering.

What about edge-triggered notification?

Posted Jan 12, 2018 2:31 UTC (Fri) by wahern (subscriber, #37304) [Link]

That's what SO_RCVLOWAT is supposed to be for, but Linux doesn't honor that for polling :(

What about edge-triggered notification?

Posted Apr 3, 2018 5:49 UTC (Tue) by anmolsarma (guest, #123439) [Link]

SO_RCVLOWAT support was added to tcp_poll() 10 years ago (c7004482e8d). Am I missing something?

What about edge-triggered notification?

Posted Apr 4, 2018 1:22 UTC (Wed) by zlynx (guest, #2285) [Link]

According to "man 7 socket", poll and select pay no attention to the value of SO_RCVLOWAT. But read does. A blocking read will block until SO_RCVLOWAT bytes are available.

Now, I don't know what the kernel actually does since I haven't looked. But that's what the documentation says.

What about edge-triggered notification?

Posted Jun 20, 2018 17:03 UTC (Wed) by nyrahul (guest, #119310) [Link]

The man page seems is not updated for the latest code. poll() does seem to honor SO_RCVLOWAT i.e. you get a read event on the socket only on reaching the watermark level. Tried and tasted.

A couple of notes

Posted Jan 10, 2018 16:33 UTC (Wed) by corbet (editor, #1) [Link]

There is a new version of the patch series out; it changes the file_operations prototypes a bit.

I forgot to mention in the article that early Red Hat kernels had this functionality with the same API, which means that the libaio library already has support for it.

A new kernel polling interface

Posted Jan 11, 2018 1:49 UTC (Thu) by xanni (subscriber, #361) [Link]

"interested parties can keep checking the mainline repository to see if it's there yet"... hmm, maybe we could implement some sort of buffer to reduce the cost of polling? Perhaps reading LWN for notifications... :)

A new kernel polling interface

Posted Feb 6, 2018 4:19 UTC (Tue) by fest3er (guest, #60379) [Link]

But that's still polling. Better to ask the repo to tell you when a new feature has been added in a particular area.

A new kernel polling interface

Posted Jan 12, 2018 9:58 UTC (Fri) by kkourt (subscriber, #48092) [Link]

> There is a payoff, though, that comes in the form of the AIO ring buffer. This poorly documented aspect of the AIO subsystem maps a circular buffer into the calling process's address space. That process can then consume notification events directly from the buffer rather than calling io_getevents().

Any pointers on how this works?

The relevant thing I found was this: http://git.infradead.org/users/hch/libaio.git/blob/refs/h..., i.e,. a way to check if the ring buffer is empty. The code here: http://git.infradead.org/users/hch/libaio.git/blob/refs/h... uses it to avoid the syscall if there is nothing in the queue. But, it will still enter the kernel to consume events.

A new kernel polling interface

Posted Jan 15, 2018 9:11 UTC (Mon) by stefanha (subscriber, #55072) [Link]

> Any pointers on how this works?

https://git.qemu.org/?p=qemu.git;a=blob;f=block/linux-aio.c;...

A new kernel polling interface

Posted Jan 17, 2018 6:40 UTC (Wed) by fuuuuuuc (guest, #120531) [Link]

Heh, keep reinventing this until at some point you give in and have something like kqueue. TL;DR main take away is Polling APIs on Linux suck, BSDs are much better in this regard.

A new kernel polling interface

Posted Mar 9, 2018 10:46 UTC (Fri) by dcg (subscriber, #9198) [Link]

There was kevent which was inspired in kqueue, I still don't understand why it didn't got merged.

A new kernel polling interface

Posted Jun 27, 2018 16:41 UTC (Wed) by ibukanov (subscriber, #3942) [Link]

This puzzles me as well. kqueue that predates epoll is just a better interface, why not just copy it?

A new kernel polling interface

Posted Oct 23, 2020 1:57 UTC (Fri) by sergeyn (guest, #142693) [Link]

How to unblock a waiting thread ? I assume this is what IO_CMD_NOOP is for, but I didn't find any examples how to use it. Thanks.