Kernel events without kevents

[Posted March 13, 2007 by corbet]

The long story of the kevent subsystem has appeared on this page a number of times. Kevents are designed to give applications a single system call which they can use to wait for any events of interest: I/O, timers, signals, and more. While quite a bit of work has been done on this code, its path into the kernel has been long. A number of developers are still unconvinced that the interface is needed, and, if it is, that the proposed kevent API (which would have to be maintained forever) is the right one. Now there is a competing approach which may prove easier for the community to accept.

Davide Libenzi is the creator of the epoll_wait() system call; it is a version of poll() which is intended to be scalable to large numbers of file descriptors. This API seems to be well regarded for what it does, but it is limited to waiting on file descriptors. Many of the things that kevents address are not associated with files, and so cannot be handled through the epoll interface.

Kevents fix that shortcoming with the creation of a new subsystem and user-space API. Davide has now shown up with a different strategy: make a way for applications to request delivery of events via a file descriptor. Consider, for example, the case of signals. Signals tend to be tricky for applications to handle; they are asynchronous events which are delivered to a special signal handler function, but that function is seriously limited in what it can do. In response, application developers have resorted to tricks like writing a byte to an internal pipe so that the signal can be handled in the main event loop.

Davide has proposed a new system call named signalfd() which can help developers avoid much of the hassle of working with signals:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

If ufd is -1, this call will create (and return) a new file descriptor. The signals described in mask will be caught and returned to the process by way of that file descriptor. It is pollable, allowing signals to be handled in an event loop based on select(), poll() or epoll_wait(). When signals are available, they can be read from the descriptor as data; the signalfd_siginfo structure returned by read() has the signal number and all of the related information that comes with it.

If ufd is set to an existing signal file descriptor, the signalfd() call will change to the new mask. It is worth noting that reading from this file descriptor competes with normal signal delivery for queued signals; there is no way to predict whether the signal will be delivered in the usual way or will be read from the file descriptor. This situation can be avoided by using sigprocmask() to block normal delivery of the signal(s) of interest.

    int timerfd(int ufd, int clockid, int timertype, 
                const struct timespec *when);

Once again, ufd is -1 to create a new file descriptor, or an existing timer file descriptor which is to be modified. The clockid parameter describes which clock is wanted: CLOCK_MONOTONIC or CLOCK_REALTIME. The type of timer is described by timertype: TFD_TIMER_REL for a time relative to the current time, TFD_TIMER_ABS for an absolute time, or TFD_TIMER_SEQ for a repeating timer at a given interval. The when structure contains the requested expiration time.

Once again, this file descriptor can be polled. Reading from it yields an integer value saying how many times the timer has fired since the last time it was read.

Evgeniy Polyakov, the author of the kevent patches, has not been sitting still while these patches have gone around. His proposal is called eventfs; it is a special filesystem which offers the ability to bind events to file descriptors. The first version of the patch only handles signals, via a system call named (yes) signalfd():

    int signalfd(int signal, int flags);

This call creates a new file descriptor for the given signal (a separate file descriptor is required for each signal in this scheme). In the current code, if flags is nonzero, the signal will only be delivered through eventfs and will never go into the signal queue. The file descriptor is pollable, but there is no way to read any information from it. So any associated signal information is lost; multiple deliveries of the same signal between polls will also be lost.

One assumes that Evgeniy's patches could be improved over time, but Davide's version seems to be ahead in terms of features, coverage, and community review. Davide has also avoided the need to create a new filesystem to back the whole thing up. So if bets were being taken on which approach might make it into the kernel, Davide would seem to be in the lead at the moment.

There are certainly things to be said for this approach. It brings Linux toward a single interface for event delivery without the need for a new, complex API. It also reinforces the role of the file descriptor as a fundamental object for interaction with the kernel. On the other hand, the poll interfaces do not provide a way for applications to receive events without the need to call into the kernel - a feature which has been requested by some interested parties. There are also event types (asynchronous I/O completion, for example) which are not yet covered. So, if things do go this way, it would not be surprising to see patches trying to fill in those gaps in the near future.

Index entries for this article
Kernel	eventfs
Kernel	Kevent
Kernel	signalfd()
Kernel	timerfd()

(Log in to post comments)

Kernel events without kevents

Posted Mar 15, 2007 4:27 UTC (Thu) by felixfix (subscriber, #242) [Link]

It's been a while since I've written signal handling code, and I remember how clever I thought I was when I thought up the idea of writing single bytes to an internal pipe. Signals really suck and I never did like them.

Is it possible that if this events-via-fd idea catches on, signal handling syscalls could eventually be done away with and emulated by libc? I salivate at the idea of reliable simple signals without all that messy overhead, or at least with that messy overhead tucked away out of sight in libc.

Kernel events without kevents

Posted Mar 15, 2007 8:17 UTC (Thu) by Ross (guest, #4065) [Link]

You didn't work on Netscape 4.x did you? :) What did you do when the pipe became full?

Kernel events without kevents

Posted Mar 15, 2007 11:05 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Netscape 4.x was a single-threaded, event-driven program, so you do like every other file descriptor, set it non-blocking, and ignore EAGAIN.

Was that a trick question, because if so, I don't get it.

Kernel events without kevents

Posted Mar 15, 2007 12:43 UTC (Thu) by nix (subscriber, #2304) [Link]

Netscape 4 had a habit of going into interminable stalls :)

Kernel events without kevents

Posted Mar 15, 2007 12:50 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Ah, there are many ways in which to wedge one-self, but thankfully, I think the simplest ones were indeed avoided. ;-)

The way Netscape 4 would wedge itself the most often for me (because I would often strace it when it did, since I was working on a big event-driven single-threaded program as well, and I'm curious to boot!) was to be reading the same file descriptor over and over, getting zero, and not getting the hint, for some reason.

I suspect there was a bit of code that wanted to do just the ONE synchronous thing, and of course, it screwed up and blocked the whole program. Tsk tsk tsk...

Kernel events without kevents

Posted Mar 15, 2007 15:28 UTC (Thu) by felixfix (subscriber, #242) [Link]

You have a main loop somewhere using select to read various file descriptors. One of them is the single byte signal pipe. You read that single byte.

The point is to not have multiple bytes which might not be written contiguously.

Kernel events without kevents

Posted Mar 15, 2007 16:04 UTC (Thu) by pphaneuf (guest, #23480) [Link]

It behaves just like a real signal. You get at least one, but it's possible that you lose some due to overflow (the signal handler got EAGAIN and ignores it, giving similar behaviour). The reading end should read until EAGAIN, of course, to drain the pipe, before handling the signal (so that if another signal comes in, it is not missed). Of course, I also use something like a 1K buffer on the stack to "eat" the bytes, I don't read a single byte at a time.

If you really want to avoid this, you can have a bool set aside that you test before writing the byte, set to true when you did, and so on, but I prefer to let the kernel do all the book-keeping for me, I have plenty of other opportunities to screw up.

On a pipe, everything is contiguous. Unless you find a way to seek. ;-)

Kernel events without kevents

Posted Mar 15, 2007 16:17 UTC (Thu) by felixfix (subscriber, #242) [Link]

Multiple writers to one pipe can easily interleave. It's been a while now, but there is that message pipe, is it the ATT substitute? which only writes complete messages atomically or returns an error. But regular pipes have no guarantees, just like writing to any file. If your data crosses a block boundary, you have no guarantees, and maybe not even if you stay within block boundaries.

Kernel events without kevents

Posted Mar 15, 2007 16:38 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Well, since I don't even look at the value that I read, it doesn't matter much, doesn't it?

Depending on environmental constraints, I either use a pipe per signal, or the pipe is only to "wake up" the select() and I check some volatile bools that the signal handlers set to know which one it was.

Furthermore, since my "messages" are a single byte (I usually use 42 as a value), and that's the granularity of a pipe, they'd still be atomic. That's if I cared about their value, of course.

Kernel events without kevents

Posted Mar 15, 2007 9:19 UTC (Thu) by simlo (guest, #10866) [Link]

In windows you map sockets to events and use WaitForMultipleObjects() to implement select(). In Linux we are apparently going to map events to filedescriptors and use select() to implement WaitForMultipleObjects().

Kernel events without kevents

Posted Mar 15, 2007 11:09 UTC (Thu) by pphaneuf (guest, #23480) [Link]

The things called "objects" in Win32 are called "file descriptors" in POSIX, but don't let yourself be misled, they don't necessarily have anything to do with "files", only "most of the time".

So select() was *already* WaitForMultipleObjects(), it's just that we are missing the APIs to map some things to "objects", so we can't wait on them. Like futexes, for example.

Kernel events without kevents

Posted Mar 15, 2007 12:31 UTC (Thu) by k8to (guest, #15413) [Link]

Perhaps I am picking a nit, or perhaps I am deeply confused. I thought select() was a pretty painful wait to get notification of activity, because the kernel side had to build up a big data blob describing what fds had activity, and then the user side had to grovel around in the big data blob to find out what fds had activity. Is WaitForMultipleObjects() this bad?

I should hope it is closer to epoll, where you get individual items indicating activity.

Kernel events without kevents

Posted Mar 15, 2007 12:46 UTC (Thu) by pphaneuf (guest, #23480) [Link]

WaitForMultipleObjects() resembles poll() a lot, except with a much more baroque return value handling, and an extra feature that could either be extremely useful or just weird, depending on who you ask (there's a bool parameter to "wait for all", where if true, everything has to be ready before returning).

select() isn't too bad on the size of the blob itself (three bits per file descriptor isn't "big", IMHO), but is quite inefficient when the numbers of the file descriptors themselves are all over the place and sparse (if you only wait on fd 1000, it has to scan the bitset for fds 0 to 999 for no reason). poll() is better for that aspect, but in exchange has a "big data blob", so handling a lot of fds is better with select(). WaitForMultipleObjects() uses a simple array of object handles, but doesn't tell you if the object is "readable", "writable" or anything like that, just that it "has been signaled", and it doesn't tell you *which* ones have been signaled, you have to check them all.

epoll is better than all of the above, IMHO. Now, the question is how does it fare compared to kevents (or BSD's kqueue)...

Kernel events without kevents

Posted Mar 15, 2007 12:57 UTC (Thu) by k8to (guest, #15413) [Link]

In the interim I went and looked up the MSDN WaitForMultipleObjects docs. Then I cried. I bet there's a way to find out which of the many objects were signalled during the call, but I can't find it anywhere in the page! Par for the course.

Regarding select I was not commenting on the size of the object, but that there there are N fields to check for N fds, and on the kernel side N fields to set for N fds, thus for large values of N this has unpleasant results. Also the code for handling it (on both sides) is more bother and less focus than a simple stream of events a la epoll. Hurrah for epoll.

The convolving of futexes to fds somehow rubs me the wrong way, but I suppose fds themselves are conceptually simple and inexpensive, and are probably implemented inexpensively.

Kernel events without kevents

Posted Mar 15, 2007 13:26 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Note also how the whole "abandoned" deal limits the number of objects that can be watched to something like 127. Yeah, there is indeed subject for crying. :-)

fds aren't that expensive, no. And really, anything blocking is subject to wanting to put it into select() or epoll, and locking a mutex can be quite blocking. What you'd want is a kind of "try lock" primitive, that would either take it right there (if it's unlocked), or if it couldn't, will make an fd ready when it could take it.

Note that you can already implement a semaphore (and thus a mutex, which is the binary semaphore special case, easily covered by a general semaphore), by using a pipe, as I "discovered" recently. The problem compared to futexes is that, if I am not mistaken, futexes only do a syscall in case of contention, where my trick does a syscall every time (it's be quick, but there I go for full disclosure).

Kernel events without kevents

Posted Mar 16, 2007 20:45 UTC (Fri) by mikov (guest, #33179) [Link]

I am not sure what you mean about not knowing which object was signalled with WaitForMultipleObjects(). It is right there in the documentation:
http://msdn2.microsoft.com/en-us/library/ms687025.aspx

The return value WAIT_OBJECT_0+index indicates that object index was signalled.

In general there is no arguing that Win32's handling of async IO, threads, synchronization, etc is very self consistent and sadly still quite ahead of the situation in Linux. Win32 is at least 14 years old and has had all that from the beginning, but we still can't quite catch up ...

Kernel events without kevents

Posted Mar 16, 2007 23:18 UTC (Fri) by pphaneuf (guest, #23480) [Link]

Hmm, for some reason, I thought it was the number of signalled objects, like select() or poll()...

It's actually the index of the signalled object, as you say, and of the lowest one if there are more than one, which means it's even crummier than I thought (how bad can it be, with a limit of 64 objects?!?): you can easily starve the highest numbered objects like that.

While WaitForMultipleObjects() is pretty awful, it's true they have some other pretty nifty things, like the overlapped I/O and completion ports stuff. I also like what you can do with a WNDPROC on an hidden window, without the application knowing anything, as long as it pumps the message queue (which on Win32, you have to do anyway for your program to work). Too bad there's nothing like that on Unix.

Win32 didn't spring fully formed 14 years ago, by the way, I'd point at all the "Ex" and "Ex2" suffixes lying about as examples of added features, when it's not whole APIs.

Kernel events without kevents

Posted Mar 17, 2007 1:50 UTC (Sat) by mikov (guest, #33179) [Link]

Well, I have to respectfully disagree (let's be extra careful not to get into a silly Win32 vs POSIX or whatever flame :-)

WaitForMultipleObjects() is not ideal, of course, and has limitations (personally I have never struggled with them after quite a few years of system programming for Win32), but they also come with their known workarounds. However it is nothing if not consistent and easy to use. It fits perfectly with the rest of the model - files, async. (overlapped), operations, threads, etc. There is never question of what is the correct way to implement something and more importantly there is no need for ridiculous hacks like using pipes from signals, etc.

Usually people who haven't seriously used the Base (*) Win32 API tend to underestimate it (I am not implying that you are one of them), but some parts of it are quite good, IMHO. As I said, the most important quality in my mind is that it is really well thought out and everything fits together. There are no corner cases which do not work. By comparison Linux or POSIX is has a few non-obvious problems and solutions - it shows that it has grown evolutionary.

(*) By the "base" Win32 API I mean pretty much only the IO, synchronzation and threads. Those parts have remained consistent and stable for as long as I remember. The rest of the Win32 API (the GUI, COM, etc) is of course a complete nightmare.

About the Ex functions. AFAIK, they were not added later, they simply provide different functionality. For example ReadFileEx() is not a newer more powerful version of ReadFile() - it just operates with a different model. More importantly, you can't emulate ReadFile() using ReadFileEx()! (Well, I could be wrong about when they were added ReadFileEx, since I haven't exactly tracked the API changes - it uses a fundamental concept though)

I agree with you that poll() and epoll() are more powerfull/convenient than WaitForMultipleObjects(). However they stand alone - you can't use them (yet) to wait for a signal, for a completion of a child process, for a semaphore (or has changed already?), etc. That is the problem !

BTW, I also tend to disagree about the utility of the WNDPROC in a hidden window for purposes like you describe. The whole UI paradigm in Win32 is a remnant from Win16, and is more or less incompatible with the rest of the API - thus the need for "hacks" like MsgWaitForMultipleObjects().

In my Windows code I've always tried at all cost to avoid using window messages for anything besides "pure" UI - in my experience they make the code very fragile (and obviously non-portable). You also have the problem that UI operations performance can limit your background tasks. I am not aware of any contemporary Win32 non-GUI API that relies on messages. (Win16 sockets used to, but that has been deprecated in Win32)

Kernel events without kevents

Posted Mar 17, 2007 11:59 UTC (Sat) by pphaneuf (guest, #23480) [Link]

64 is a pretty low number of objects, I'd say, but it depends on the design. It seems more oriented to a "one thread per connection" design, where a single connection could be handling a few things (waiting for an answer from a database, but also watching the socket for disconnection, say). Also, the way it can easily lead to starvation seems like a major design problem with it. Careless ordering of the list of handles could lead to a stuck process! Odd, that.

I both agree and disagree with your reply. Win32 had advanced capabilities for a long time, such as completion ports (although I think they were severely crippled on the non-NT platforms, if I recall correctly). We still don't have many of those on Linux, and POSIX itself is so out of date, it's practically irrelevant. If you want to do something high-performance that works on multiple Unix platforms, you have to invent your own abstraction for the various high-performance APIs, because POSIX is just too pathetic. Sure, you can also make a POSIX version for complete portability, but you just know that this one isn't going to be the one for high-performance requirements.

You say

It's a kind strange, comparing the two APIs. I feel that the Win32 is a bit more integrated, yes, having had this stuff for a longer time (everything is an object with a handle, that can be given to WaitForMultipleObjects()), but somehow doesn't feel like that (reading from a socket with ReadFile() seems a bit odd).

The classic Unix API feels more tasteful, but seems to suffer from some rot. For example, newer additions, like threads, feels like they are crummy copies of other APIs, not fitting in well with the rest. For example, what you say about poll() not being able to wait on a number of things is really mostly linked to those new things (mostly related to threads) not having been made file descriptors in the first place! Remember that file descriptors are more or less the Unix equivalent of handles, despite having the name "file" in it (it's the "everything is a file" concept).

The signals are the exception (and thus, waiting for child processes), but that's by design. There are only two ways to affect a Unix process, synchronously (but not necessarily blocking) through a file descriptor, or asynchronously through a signal. And there are bridges to go from one to another (SIGIO to make file descriptor asynchronous, and pipes to turn signals into a synchronous event). The signal and pipe trick might sound hacky, but really, it's a matter of keeping the core simple, providing lightweight primitives that can be built upon to make the same effect.

It helps to remember Windows NT heritage from VMS, which had a number of distinct file types, that were opened and accessed completely differently, including a B-Tree file type. Think about it: on VMS, there was the equivalent of Berkeley DB in the kernel! Where on Unix, the philosophy is more that we'll give you the tools to write Berkeley DB, and you go from there. Hence my not being incredibly excited about signalfd(): it's nice, but not exceedingly so, since I could easily do without. In my Windows code I've always tried at all cost to avoid using window messages for anything besides "pure" UI - in my experience they make the code very fragile (and obviously non-portable). You also have the problem that UI operations performance can limit your background tasks. I am not aware of any contemporary Win32 non-GUI API that relies on messages. (Win16 sockets used to, but that has been deprecated in Win32)

You mention UI operations limiting your "background" tasks. This is always true, for a single thread, as really there isn't one that's foreground and the other that's background, they're all equals, competing for execution on a thread. If your UI is giving you grief, the answer isn't necessarily to stop using messages, but to have another thread, so that there's more execution contexts to do the work. Ideally, it would all be so uniform that UI code could run on any thread as well, so things would just get done as quickly as possible, no matter what it is, nothing having the edge over the other.

Kernel events without kevents

Posted Mar 17, 2007 18:35 UTC (Sat) by mikov (guest, #33179) [Link]

64 is a pretty low number of objects, I'd say, but it depends on the design. It seems more oriented to a "one thread per connection" design, where a single connection could be handling a few things (waiting for an answer from a database, but also watching the socket for disconnection, say). Also, the way it can easily lead to starvation seems like a major design problem with it. Careless ordering of the list of handles could lead to a stuck process! Odd, that.

Well, I've never had to use WaitForMultipleObjects() with more than a few handles. I think that for a single-threaded IO server design (which admittedly I haven't done), I'd use the ReadFileEx family of functions, which register a completion callback (APC) - this elliminates the 64 handle problem. Then I'd use WaitForMultipleObjectsEx() if I have to let the callbacks run _and_ wait for semaphores and stuff.

Addmitedly, since Win32 has always had well integrated threads, the need to shoehorn everything into a single thread has never been very strong.

It's a kind strange, comparing the two APIs. I feel that the Win32 is a bit more integrated, yes, having had this stuff for a longer time (everything is an object with a handle, that can be given to WaitForMultipleObjects()), but somehow doesn't feel like that (reading from a socket with ReadFile() seems a bit odd).

I guess it is force of habit. The other methods are also available - recv(),read(), etc. BTW, for some things the Win32 API can be a major PITA. For example serial communication - this is where you don't want overlapped operations and WaitForMultipleObjects - instead you really want select(). Alas, in Win32 select() works only on sockets ...

If your UI is giving you grief, the answer isn't necessarily to stop using messages, but to have another thread, so that there's more execution contexts to do the work. Ideally, it would all be so uniform that UI code could run on any thread as well, so things would just get done as quickly as possible, no matter what it is, nothing having the edge over the other.

But that's exactly it. In the main thread of a GUI application you need messages to drive the GUI. However if you create another thread (let's say for IO), there is no point at all in using Windows messages there. None of the APIs generate them, so you'd have to write code to send them yourself, make a message loop to handle them, etc. What's the point ? If you really needed some sort of of message queue in your design, there is zero reason to use the Windows GUI one - you are much better off coding something custom.

Kernel events without kevents

Posted Mar 15, 2007 17:18 UTC (Thu) by bronson (subscriber, #4806) [Link]

...it's just that we are missing the APIs to map some things to "objects", so we can't wait on them. Like futexes, for example.

Er, doesn't futex(2) support FUTEX_FD? I'm going to use this in code I'm going to write next week. Someone please tell me now if this doesn't work!

Kernel events without kevents

Posted Mar 15, 2007 17:25 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Sorry for the confusion, I was unaware of FUTEX_FD! That said, I didn't test it either, but it looks similar to what I have been wanting.

I suppose what I am wishing for is for this to be in the POSIX thread API itself, but that's asking for a great deal.

Kernel events without kevents

Posted Mar 16, 2007 15:53 UTC (Fri) by cventers (guest, #31465) [Link]

I think I remember reading something on LKML about FUTEX_FD being
broken...

Kernel events without kevents

Posted Mar 16, 2007 15:58 UTC (Fri) by pphaneuf (guest, #23480) [Link]

Since it doesn't see any use in the mutex API that everyone uses, it's not surprising... Code that's not used tends to rot.

Too bad, I would like that. I'll stay with my "pipes as semaphores" trick in the meantime...

Kernel events without kevents

Posted Mar 16, 2007 16:47 UTC (Fri) by bronson (subscriber, #4806) [Link]

You're right. Apparently it's "unfixably racy" and will disappear in June 2007.

http://lkml.org/lkml/2006/10/31/391

And it looks like there's no alternative other than self-pipes. Arg!

what about wait?

Posted Mar 15, 2007 11:24 UTC (Thu) by mcatkins (guest, #4270) [Link]

I used a version of unix 20 years ago that had a /dev/wait special file, and
reading from it subsumed the wait system call. (one could also select, etc)

1) The wait system call should also be added to this scheme

2) Why do we need yet more system calls for any of this?
Why not just have special files in /dev?
[Neither approach allows one to do anything "interesting", like reading
another process' signal/timer/wait queue! Reading the special files
is/would be related to the reading process, like /dev/stdin, etc.]

Martin

Kernel events without kevents

Posted Mar 15, 2007 13:07 UTC (Thu) by pphaneuf (guest, #23480) [Link]

This is nothing short of fantastic! We're getting really close to my litmus test of being to implement an asynchronous DNS resolver without threads that simply calls a callback when a request is done. I have been wishing for exactly this for a while now (note that I didn't ask for signals, because I know how to fake those myself, but now I get them for free! huzzah!).

All that's missing is a way to make it so that epoll_wait() automatically calls my callback when its event is tripped, just like Linus described back in 2000. :-)

There's still a bit of the "library problem", where a library still has to cooperate with the application that linked with it in order to have its events processed, but it's not too bad. With this and epoll, a library could create its own epoll fd, put all its things in there, return it to the application and tell it to call a certain function when it is readable. Even if epoll_wait() did automatically called back event handlers, an application would have to pass it an epoll fd to register its events with anyway.

On Windows, this is handled with a hidden window that has a WNDPROC, and it's one notch better, because the application doesn't have to know anything. It's like there was no epoll_create, and that the other epoll syscalls used a single list of events per process. But we're close enough that I think we're just about good here, at least for the next little while...

signals without signal stacks

Posted Mar 15, 2007 15:31 UTC (Thu) by pjones (subscriber, #31722) [Link]

It's better than that - with signalfd(), you can use e.g. malloc(), printf(), and backtrace() while handling a signal.

That's really a huge win. It'll also mean things like Xorg won't need to inject crap like VT_CHANGE handling into its event loop from a signal handler.

signals without signal stacks

Posted Mar 15, 2007 15:43 UTC (Thu) by pphaneuf (guest, #23480) [Link]

You could do malloc() and printf() with the "write a byte on a pipe in the signal handler" trick, but backtrace()? It won't be like doing it from a signal handler, it'll just give the stack trace of where you read() from the signal handler fd, no? That, again, would be just like the pipe trick.

Not that it's not neat, it saves a whole lot of coding, having to save things on the side so you can look at them later, but it basically operates the same way. I didn't focus on those very much, because handling signals in a library, through the pipe trick or not, is just a bad idea, IMHO (the application hooks the same signal, and then everything goes to hell in a handbasket).

But the timer is something that didn't really need coordination with an application (from the point of view of a library), but had to, because, well, that's the way it was. I had some idiotic tricks that worked, but were just disgusting (a thread started from my library to handle timers and write to a pipe to wake up the main loop, eww!). Well, not anymore!

signals without signal stacks

Posted Mar 15, 2007 16:11 UTC (Thu) by pjones (subscriber, #31722) [Link]

Well, it's not _always_ a bad idea. Consider the SIGPIPE vs db4 problem. If you have a library using both db4 and a network, you sometimes need to block signals - but you still want them to be raised to the caller. With signalfd(), you can do a fairly simple callback system sanely, which you really can't do as cleanly with the old-style signal/sigaction interfaces.

(yeah, arguably you can raise(), but you still have to have a doesn't-do-much signal handler deep in a library, which can get really ugly really fast)

Kernel events without kevents

Posted Mar 15, 2007 19:29 UTC (Thu) by mtaht (guest, #11087) [Link]

Um, er, epoll can call your callback with only a tiny bit of wrapping

see http://boston.conman.org/2007/03/08

Kernel events without kevents

Posted Mar 15, 2007 21:01 UTC (Thu) by bronson (subscriber, #4806) [Link]

http://svn.u32.net/io/trunk/ works pretty well too but I haven't gotten around to documenting it yet... if ever...

Kernel events without kevents

Posted Mar 16, 2007 7:03 UTC (Fri) by pphaneuf (guest, #23480) [Link]

Thanks, it seems interesting, I'll definitely have a look, since I'm in the process of making a similar edge-triggered wrapper like this, but with the added twist of multithreaded (a limited number of threads, so that more events can be handled in a given amount of time, to use multicore systems and such while still being event driven).

Kernel events without kevents

Posted Mar 16, 2007 11:55 UTC (Fri) by bronson (subscriber, #4806) [Link]

I agree, multicore is here to stay. That code lets me run one epoll poller per thread and one thread per core (plus a few maintenance threads). I haven't tried it under serious load yet so there may be a few small bugs left to wiggle out, and the poller selection is utterly hacked (it's a todo item), but it works for me so far.

Feel free to mail me at bronson at domain rinspin.com.

Kernel events without kevents

Posted Mar 16, 2007 6:59 UTC (Fri) by pphaneuf (guest, #23480) [Link]

Indeed, and I very much love epoll for that, but for that to work between a library and an application (with the library putting things into the epoll fd, and the application being the one calling epoll_wait()), they pretty much have to use the same tiny bit of wrapping.

After that, it can totally be done, but if you make your library for, say, libevent, and someone tries using it in a Qt program, it's a pain.

Kernel events without kevents

Posted Mar 16, 2007 2:59 UTC (Fri) by wahern (subscriber, #37304) [Link]

Huh? I've been using asynchronous DNS resolvers for years:

ADNS
C-Ares
UDNS

My core event loop is libevent, which handles callbacks for signals, timers and I/O readiness. I
currently use C-Ares for sending and receiving raw DNS messages, and my lookup API in my
async meta-API library libevnet (since C-Ares tries to mirror the useless gethostbyname
interface). In libevnet you can ask for an MX+A record, and it will ultimately always get back A
(and/or AAAA if you specified) records suitable for sending mail. And this can be expanded
upon, so that you can take ask the library to do the smart thing:

s = socket_open(&socket_defaults);
socket_name_init(&n, "google.com", "smtp", LOOKUP_IN_MX|LOOKUP_IN_A|LOOKUP_IN_AAAA)
tv.tv_sec = 5;
socket_connect(s, &n, &my_callback, my_arg, &tv);

Kernel events without kevents

Posted Mar 16, 2007 10:49 UTC (Fri) by pphaneuf (guest, #23480) [Link]

I was using that as an example of a library that uses multiple file descriptors and has its own timeouts. And yes, it is already possible, my issue is that it's just so clunky.

For example, ADNS has two calls to fiddle with your select()/poll() parameters before and after, so if you use something else, you have to hack a bit to know its file descriptors. In particular, it doesn't match well with an API where you register the interests only once, which is the case of every single new API (because it's fundamentally more efficient than starting from scratch every time).

C-Ares cuts on the hacking a small bit, since it gives out the file descriptors it's interested in up-front, rather than having to go through some an array of struct pollfd. But it's still oriented toward a "from scratch every time" API like select()/poll(), so you have to remember what it said last time, and tweak your interest set accordingly. Not my idea of fun and painless, but I've done it, and I've lived through it.

UDNS restricts itself to using a single file descriptor, in an attempt to make this integration easier, but this comes at the cost of not being able to do TCP queries (which are required when a response is too large, which is actually fairly common for MX queries of large sites). So, it arguably crippled itself functionally in order to do what I said, still leaving timeout management to deal with (but that part is easy, at least).

With timerfd, one can use epoll_create() in a library, return that to the application and tell them quite simply "when this fd is readable, call this function here", and that's it. The main application can use select(), poll(), or whatever it feels like using, it doesn't have to deal with anything in its timeout management, it's all reduced to a single bit of information: is this fd readable?

Xlib is also like that (through ConnectionNumber()), which makes it very easy to deal with, but that' a bit easier, since it doesn't have timeouts and really just has the one file descriptor to deal with. Hence my using asynchronous DNS as an example with multiple descriptors (if you don't punt on the TCP queries) and timeouts.

Kernel events without kevents

Posted Mar 24, 2007 1:20 UTC (Sat) by slamb (guest, #1070) [Link]

That's a problem of poor library interfaces, not poor kernel interfaces. And in the case of C-Ares, it's not even true. Look at ARES_OPT_SOCK_STATE_CB.

Kernel events without kevents

Posted Mar 24, 2007 1:11 UTC (Sat) by slamb (guest, #1070) [Link]

I'm not sure what you're asking for. What new kernel interface do you need for asynchronous, single-threaded DNS resolution? There are several such libraries already, and they work fine for me.

Kernel events without kevents

Posted Mar 15, 2007 13:45 UTC (Thu) by pphaneuf (guest, #23480) [Link]

On the other hand, the poll interfaces do not provide a way for applications to receive events without the need to call into the kernel - a feature which has been requested by some interested parties.

I keep being mystified by this, since there clearly need to be some synchronization between userspace and the kernel, which seems to be done with kevent_commit(), which is... a system call, that, uh, calls into the kernel.

Generality of kevent with regard to AIO and other event sources can be interesting, its possibly avoiding copies through mmapping the ring buffer between userspace and the kernel, is another possibly interesting, and while I happen to not be too convinced by either of these points, they have the merit of at least existing.

But unless I am nuts (which might be the case!), getting events from kevent requires calling into the kernel. Either someone explains to me how this is not the case, exactly, or people stop saying that.

Kernel events without kernel calls

Posted Mar 15, 2007 13:55 UTC (Thu) by corbet (editor, #1) [Link]

But unless I am nuts (which might be the case!), getting events from kevent requires calling into the kernel. Either someone explains to me how this is not the case, exactly, or people stop saying that.

That's what the whole user-space event ring mechanism is about. This article from December describes a recent version of the API.

Kernel events without kernel calls

Posted Mar 15, 2007 14:03 UTC (Thu) by pphaneuf (guest, #23480) [Link]

And, as per that linked article, when events are consumed from the ring buffer, kevent_commit() (which I presume is a system call, "calling into the kernel"), has to be called "from time to time".

Some batching can be done, but this is very similar to the kind of batching that happens when calling epoll_wait() with a "maxevents" parameter bigger than one.

So, there's a system call at a similar frequency as epoll occuring. I'll grant you, there is some opportunity for less copying of the events structures, as I mentioned, but that's it.

Kernel events without kernel calls

Posted Mar 22, 2007 22:56 UTC (Thu) by hno (guest, #43549) [Link]

There is significant difference between epoll and ring buffer batching.

with epoll you have an syscall each and every time you want to process events, even if there is a single event to process.

With a ring buffer you only need to notify the kernel when there is a risk the kernel may run out of space in the buffer. For example once per rinbuffersize/2 number of events processed.

But it's true that when you get heavily CPU bound and not able to keep up with the rate of events the two converges to about the same. But as long as processing is able to keep up with the rate of events the ring buffer always wins, i.e. in all situations except complete overload.

So in worst case the ring buffer performs equal to explicit call, on average significantly better than explicit call.

Kernel events without kevents

Posted Mar 15, 2007 16:45 UTC (Thu) by vmole (guest, #111) [Link]

It is worth noting that reading from this file descriptor competes with normal signal delivery for queued signals; there is no way to predict whether the signal will be delivered in the usual way or will be read from the file descriptor. This situation can be avoided by using sigprocmask() to block normal delivery of the signal(s) of interest.

Oh, good, another set of racy signal functions. If I call signalfd() before I call sigprocmask(), there's a period where signals can be delivered either way. If I call sigprocmask() first, am I guaranteed that the masked signals will queued to the signal fd when I get around to creating it? This mess is why sigaction(2) was invented, to allow you set both signal of interest and signals to mask in one operation. Is there any reason why you'd want signals delivered both ways? If not, then why doesn't signalfd() automatically mask the signals in its arguments?

Kernel events without kevents

Posted Mar 17, 2007 0:41 UTC (Sat) by giraffedata (guest, #1954) [Link]

If I call sigprocmask() first, am I guaranteed that the masked signals will queued to the signal fd when I get around to creating it?

I haven't seen the implementation, but that would be the most natural way. I don't think signals are queued to a fd at all; I think read of the fd checks for and receives pending signals, in spite of any mask (so poll checks for pending signals). And assuming it's documented with those words, there's not much room for confusion.

This mess is why sigaction(2) was invented, to allow you to set both signal of interest and signals to mask in one operation.

How does it do that? I don't think sigaction modifies the signal mask.

And you don't need it to. Where would it be useful?

However, I agree that there is no practical application of signals arbitrarily being delivered traditionally or to an fd, so except that it's extra code, it would be sensible to have signalfd() automatically block the signal class.

Kernel events without kevents

Posted Mar 17, 2007 17:19 UTC (Sat) by vmole (guest, #111) [Link]

How does it do that? I don't think sigaction modifies the signal mask.

I misremembered, partially. The sa_mask member of the structure just blocks signals during delivery of the desired signal, not all the time. (Before sigaction(2), you had to try and call sigprocmask(2) from inside your signal handler to get this affect, which was completely unreliable.) This kind of blocking isn't necessary with signalfd because (duh!) the signals aren't delivered asynchronously.

I wrote the man page for that four years ago :-)

Posted Mar 15, 2007 22:11 UTC (Thu) by dank (guest, #1865) [Link]

http://marc.theaimsgroup.com/?l=linux-kernel&m=993560...

Once upon a time a hacker named Xman
wrote a library that used aio, and decided
to use sigtimedwait() to pick up completion
notifications. It worked well, and his I/O
was blazing fast (since was using a copy
of Linux that was patched to have good aio).
But when he tried to integrate his library
into a large application someone else had
written, woe! that application's use of signals
conflicted with his library. "Fsck!" said Xman.
At that moment a fairy appeared, and said
"Young man, watch your language, or I'm going to
have to turn you into a goon! I'm the good fairy Eunice.
Can I help you?" Xman explained his problem to Eunice,
who smiled and said "All you need is right here,
just type 'man 2 sigopen'". Xman did, and saw:

SIGOPEN(2) Linux Programmer's Manual SIGOPEN(2)

NAME
sigopen - open a signal as a file descriptor

SYNOPSIS
#include <signal.h>

int sigopen(int signum);

DESCRIPTION
The sigopen system call opens signal number signum as a file descriptor.
That signal is no longer delivered normally or available for pickup
with sigwait() et al. Instead, it must be picked up by calling
read() on the file descriptor returned by sigwait(); the buffer passed to
read() must have a size which is a multiple of sizeof(siginfo_t).
Multiple signals may be picked up with a single call to read().
When that file descriptor is closed, the signal is available once more
for traditional use.
A signal number cannot be opened more than once concurrently; sigopen()
thus provides a way to avoid signal usage clashes in large programs.

RETURN VALUE
signal returns the new file descriptor, or -1 on error (in which case, errno
is set appropriately).

ERRORS
EWOULDBLOCK signal is already open

NOTES
read() will block when reading from a file descriptor opened by sigopen()
until a signal is available unless fcntl(fd, F_SETFL, O_NONBLOCK) is called
to set it into nonblocking mode.

HISTORY
sigopen() first appeared in the 2.5.2 Linux kernel.

Linux July, 2001 1

When he finished reading, he knew just how to solve his
problem, and he lived happily ever after.

The End.

- Dan

Kernel events without kevents

Posted Jul 9, 2007 6:54 UTC (Mon) by jel (guest, #38548) [Link]

This all sounded good, up until the "if flags is non-zero" part. I really
hope that's actually defined as "if flags is 1". Otherwise, it's losing
~4billion possible meanings (more on 64bit, I guess), any of which could
be used for extended functionality in future.