KAISER: hiding the kernel from user space

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
November 15, 2017

Since the beginning, Linux has mapped the kernel's memory into the address space of every running process. There are solid performance reasons for doing this, and the processor's memory-management unit can ordinarily be trusted to prevent user space from accessing that memory. More recently, though, some more subtle security issues related to this mapping have come to light, leading to the rapid development of a new patch set that ends this longstanding practice for the x86 architecture.

Some address-space history

On 32-bit systems, the address-space layout for a running process dedicated the bottom 3GB (0x00000000 to 0xbfffffff) for user-space use and the top 1GB (0xc0000000 to 0xffffffff) for the kernel. Each process saw its own memory in the bottom 3GB, while the kernel-space mapping was the same for all. On an x86_64 system, the user-space virtual address space goes from zero to 0x7fffffffffff (the bottom 47 bits), while kernel-space mappings are scattered in the range above 0xffff880000000000. While user space can, in some sense, see the address space reserved for the kernel, it has no actual access to that memory.

This mapping scheme has caused problems in the past. On 32-bit systems, it limits the total size of a process's address space to 3GB, for example. The kernel-side problems are arguably worse, in that the kernel can only directly access a bit less than 1GB of physical memory; using more memory than that required the implementation of a complicated "high memory" mechanism. 32-Bit systems were never going to be great at using large amounts of memory (for a 20th-century value of "large"), but keeping the kernel mapped into user space made things worse.

Nonetheless, this mechanism persists for a simple reason: getting rid of it would make the system run considerably slower. Keeping the kernel permanently mapped eliminates the need to flush the processor's translation lookaside buffer (TLB) when switching between user and kernel space, and it allows the TLB entries for kernel space to never be flushed. Flushing the TLB is an expensive operation for a couple of reasons: having to go to the page tables to repopulate the TLB hurts, but the act of performing the flush itself is slow enough that it can be the biggest part of the cost.

Back in 2003, Ingo Molnar implemented a different mechanism, where user space and kernel space each got a full 4GB address space and the processor would switch between them on every context switch. The "4G/4G" mechanism solved problems for some users and was shipped by some distributors, but the associated performance cost ensured that it never found its way into the mainline kernel. Nobody has seriously proposed separating the two address spaces since.

Rethinking the shared address space

On contemporary 64-bit systems, the shared address space does not constrain the amount of virtual memory that can be addressed as it used to, but there is another problem that is related to security. An important technique for hardening the system is kernel address-space layout randomization (KASLR), which randomizes the placement of the kernel in the virtual address space at boot time. By denying an attacker the knowledge of where the kernel lives in memory, KASLR makes many types of attack considerably more difficult. As long as the actual location of the kernel does not leak to user space, attackers will be left groping in the dark.

The problem is that this information leaks in many ways. Many of those leaks date back to simpler days when kernel addresses were not sensitive information; it even turns out that your editor introduced one such leak in 2003. Nobody was worried about exposing that information at that time. More recently, a concerted effort has been made to close off the direct leaks from the kernel, but none of that will be of much benefit if the hardware itself reveals the kernel's location. And that would appear to be exactly what is happening.

This paper from Daniel Gruss et al. [PDF] cites a number of hardware-based attacks on KASLR. They use techniques like exploiting timing differences in fault handling, observing the behavior of prefetch instructions, or forcing faults using the Intel TSX (transactional memory) instructions. There are rumors circulating that other such channels exist but have not yet been disclosed. In all of these cases, the processor responds differently to a memory access attempt depending on whether the target address is mapped in the page tables, regardless of whether the running process can actually access that location. These differences can be used to find where the kernel has been placed — without making the kernel aware that an attack is underway.

Fixing information leaks in the hardware is difficult and, in any case, deployed systems are likely to remain vulnerable. But there is a viable defense against these information leaks: making the kernel's page tables entirely inaccessible to user space. In other words, it would seem that the practice of mapping the kernel into user space needs to end in the interest of hardening the system.

KAISER

The paper linked above provided an implementation of separated address spaces for the x86-64 kernel; the authors called it "KAISER", which evidently stands for "kernel address isolation to have side-channels efficiently removed". This implementation was not suitable for inclusion into the mainline, but it was picked up and heavily modified by Dave Hansen. The resulting patch set (still called "KAISER") is in its third revision and seems likely to find its way upstream in a relatively short period of time.

Whereas current systems have a single set of page tables for each process, KAISER implements two. One set is essentially unchanged; it includes both kernel-space and user-space addresses, but it is only used when the system is running in kernel mode. The second "shadow" page table contains a copy of all of the user-space mappings, but leaves out the kernel side. Instead, there is a minimal set of kernel-space mappings that provides the information needed to handle system calls and interrupts, but no more. Copying the page tables may sound inefficient, but the copying only happens at the top level of the page-table hierarchy, so the bulk of that data is shared between the two copies.

Whenever a process is running in user mode, the shadow page tables will be active. The bulk of the kernel's address space will thus be completely hidden from the process, defeating the known hardware-based attacks. Whenever the system needs to switch to kernel mode, in response to a system call, an exception, or an interrupt, for example, a switch to the other page tables will be made. The code that manages the return to user space must then make the shadow page tables active again.

The defense provided by KAISER is not complete, in that a small amount of kernel information must still be present to manage the switch back to kernel mode. In the patch description, Hansen wrote:

The minimal kernel page tables try to map only what is needed to enter/exit the kernel such as the entry/exit functions, interrupt descriptors (IDT) and the kernel trampoline stacks. This minimal set of data can still reveal the kernel's ASLR base address. But, this minimal kernel data is all trusted, which makes it harder to exploit than data in the kernel direct map which contains loads of user-controlled data.

While the patch does not mention it, one could imagine that, if the presence of the remaining information turns out to give away the game, it could probably be located separately from the rest of the kernel at its own randomized address.

The performance concerns that drove the use of a single set of page tables have not gone away, of course. More recent processors offer some help, though, in the form of process-context identifiers (PCIDs). These identifiers tag entries in the TLB; lookups in the TLB will only succeed if the associated PCID matches that of the thread running in the processor at the time. Use of PCIDs eliminates the need to flush the TLB at context switches; that reduces the cost of switching page tables during system calls considerably. Happily, the kernel got support for PCIDs during the 4.14 development cycle.

Even so, there will be a performance penalty to pay when KAISER is in use:

KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.

Not that long ago, a security-related patch with that kind of performance penalty would not have even been considered for mainline inclusion. Times have changed, though, and most developers have realized that a hardened kernel is no longer optional. Even so, there will be options to enable or disable KAISER, perhaps even at run time, for those who are unwilling to take the performance hit.

All told, KAISER has the look of a patch set that has been put onto the fast track. It emerged nearly fully formed and has immediately seen a lot of attention from a number of core kernel developers. Linus Torvalds is clearly in support of the idea, though he naturally has pointed out a number of things that, in his opinion, could be improved. Nobody has talked publicly about time frames for merging this code, but 4.15 might not be entirely out of the question.

Index entries for this article
Kernel	Memory management/User-space layout
Kernel	Security/Meltdown and Spectre
Security	Linux kernel

(Log in to post comments)

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 5:54 UTC (Wed) by jreiser (subscriber, #11027) [Link]

(CR3 manipulation) add a few hundred cycles to a syscall or interrupt. That's a couple L3 cache misses [CAS latency on SDRAM has been ~60ns for decades] which probably is tolerable on a syscall. But hundreds of cycles is horrible for an interrupt. [33MHz is a common bus clock, so just generating an interrupt already requires an average latency of ~15ns.] Some architectures have a special interrupt context (and/or separate small locked caches) exactly for this reason.

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 15:39 UTC (Wed) by hansendc (subscriber, #7363) [Link]

Interrupts are already a small number of thousands cycles, even for a quick one. The entry plus IRET costs alone probably eclipse the (new) CR3 manipulation cost. So, while this CR3 manipulation makes things worse, it does not fundamentally change the speed of an interrupt.

KAISER: hiding the kernel from user space

Posted Dec 9, 2018 15:58 UTC (Sun) by dembego3 (guest, #129120) [Link]

Thank you for caring for my safety:-)

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 8:19 UTC (Wed) by marcH (subscriber, #57642) [Link]

> a number of hardware-based attacks on KASLR. They use techniques like exploiting timing differences in fault handling,

I find timing-based attacks fascinating.

In order to grow, computer performance has become less and less deterministic. On one hand this makes real-time and predictions harder. On the other hand this leaks more and more information about the system.

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 14:19 UTC (Wed) by epa (subscriber, #39769) [Link]

If these timing-based attacks involve accessing a page in the kernel address space and getting some kind of memory protection fault, can't the kernel add a small random delay each time such a fault is hit before control returns to user space? The delay could even increase with subsequent faults, imposing a ceiling on how many faults the process can generate. That is, provided there's a way to do this while not imposing that same delay on processes that are using these faults for other things, like memory-mapped files.

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 15:55 UTC (Wed) by matthias (subscriber, #94967) [Link]

This will not work as todays CPUs provide the possibility to trigger a pagefault without involving the kernel (e.g. TSX instructions). These instructions simply fail if the memory is not mapped or not accessible unlike usual memory accesses that would involve the kernels pagefault handler.

I did also not know this before, but several of these attacks are described in the linked paper.

KAISER: hiding the kernel from user space

Posted Nov 27, 2017 15:46 UTC (Mon) by abufrejoval (subscriber, #100159) [Link]

I'd like to see CPUs turn to randomized instruction timings to implement power control and throw a punch at timing based attacks this way.
These days where CPUs constantly vary their speeds to either exploit every bit of thermal headroom they can find or re-adjust constantly to hit an energy optimum for a limited value workload, it seems almost stupid to try sticking to a constant speed.

If instead you set a randomization bias you can run CPUs at say 5GHz logical clock and then add random delays to hit say 3, 2 or 1 GHz on average depending on the workload. Every iteration of an otherwise pretty identical loop would wind up a couple of clocks different, throwing off snoop code without much of an impact elsewhere. Of course it shouldn't be one central clock overall, but essentially any clock domain could use its own randomization source and bias. I guess CPUs have vast numbers of clock synchronization gates these days anyway, so very little additional hardware should be required.

Stupid, genius or simply old news?

KAISER: hiding the kernel from user space

Posted Nov 27, 2017 16:55 UTC (Mon) by NAR (subscriber, #1313) [Link]

It might hinder optimization if running the same code under same circumstances results in different execution speed...

KAISER: hiding the kernel from user space

Posted Jan 5, 2018 8:26 UTC (Fri) by clbuttic (guest, #121058) [Link]

ATLAS (Automatically Tuned Linear Algebra Software) does a lengthy tuning step that expects consistent CPU performance.

KAISER: hiding the kernel from user space

Posted Jan 6, 2018 16:02 UTC (Sat) by nix (subscriber, #2304) [Link]

Of course, that tuning step already requires sysadmin intervention to turn off CPU frequency changes, ideally hyperthreading, etc: adding one more intervention isn't likely to be terribly difficult.

KAISER: hiding the kernel from user space

Posted Nov 27, 2017 17:43 UTC (Mon) by excors (subscriber, #95769) [Link]

What stops an attacker from just running their test thousands or millions of times, to average out the randomisation that you've added?

For example, the KAISER paper says the "double page fault attack" distinguishes page faults taking 12282 cycles for mapped pages and 12307 cycles for unmapped pages, i.e. a difference of 25 cycles. If I remember how maths works: You could add a random delay to the page fault handler (or randomly vary the CPU speed or whatever) so it has a mean and standard deviation of (M, S) for mapped pages and (M+25, S) for unmapped. If S > 25 (very roughly) then the attacker can measure a page fault but can't be sure whether it belongs to the first category or the second.

But if they repeat it 10,000 times (which only takes a few msecs) and average their measurements, they'd expect to get a number in the distribution (M, S/100) for mapped pages or (M+25, S/100) for unmapped. You'd have to make S > 2500 to stop them being able to distinguish those cases easily. At that point it's much more expensive than the KAISER defence, and it would still be useless against an attacker who can repeat the measurement a million times. And that's for measuring a relatively tiny difference of 25 cycles in an operation that takes ~12K cycles; it's harder to protect the TSX or prefetch attacks where the operation only takes ~300 cycles.

It seems much safer to ensure operations will always take a constant amount of time, rather than adding randomness and just hoping the statistics work in your favour.

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 9:27 UTC (Wed) by sorokin (guest, #88478) [Link]

> Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical.

A single change may not affect performance significantly (although 5% slow down is too much for my taste). But multiple changes can stack up over time. In the past this has been seen both for performance improving and for performance regression changes. A performance regression example is how compilers' performance regressed overtime (although since GCC 6 the trend has reversed). A performance improving example is sqlite.

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 11:25 UTC (Wed) by ballombe (subscriber, #9523) [Link]

> Since the beginning, Linux has mapped the kernel's memory into the address space of every running process.

IIRC this is not true on SPARC, is it ?

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 14:05 UTC (Wed) by arjan (subscriber, #36785) [Link]

I think you might mean s390

KAISER: hiding the kernel from user space

Posted Nov 15, 2017 14:56 UTC (Wed) by cborni (subscriber, #12949) [Link]

I think both SPARC (I am sure about Solaris, but not Linux) and s390 (here I am sure for Linux) have separate kernel and user address spaces.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 2:06 UTC (Thu) by jamesmorris (subscriber, #82698) [Link]

Correct for SPARC. The user and kernel address spaces are separate.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 2:10 UTC (Thu) by jamesmorris (subscriber, #82698) [Link]

Also, it's an architectural feature, so it's true on Linux.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 12:52 UTC (Thu) by ballombe (subscriber, #9523) [Link]

So maybe pushing this will motivate architecture designers to provide this feature, like it was done for virtualization.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 4:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Can KAISER be made optional on process-level? Perhaps through a croup?

I would definitely like to protect my browser and anything started by it, but I would like my gcc started from a terminal to run at full speed.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 20:40 UTC (Thu) by hansendc (subscriber, #7363) [Link]

Yes, this could be done, at least theoretically. But, the contexts where we have to decide to "do KAISER" or not are very tricky. We don't have a stack and don't have registers to clobber, so it's tricky to pull off.

You would essentially need to keep a bit of per-cpu data that was consulted very early in assembly at kernel entry. It would have to be updated at every context switch, probably from some flag in the task_struct. Again, doable, but far from trivial.

KAISER: hiding the kernel from user space

Posted Nov 19, 2017 6:16 UTC (Sun) by luto (subscriber, #39314) [Link]

It could be TIF_KAISER, no?

But this is definitely not a v1 feature.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 7:21 UTC (Thu) by alkbyby (subscriber, #61687) [Link]

Looks like something bad is coming. Such as mega-hole maybe in hardware that can be mitigated by hiding kernel addresses.

Otherwise I cannot see why simply hiding kernel addresses better, suddenly becomes important enough to spend massive amount of cpu on it.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 8:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Kernel address hiding is needed to protect the kernel in case there's a bug that allows code execution in the kernel mode.

But it looks like software-based hiding is ineffective by itself with the current model.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 9:29 UTC (Thu) by alkbyby (subscriber, #61687) [Link]

"in case" doesn't seem enough justification to pay such a massive price.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 10:41 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Sigh. The kernel root holes are being found every month or so. But in order to exploit them reliably you need to know the kernel memory layout. And Most obvious software leaks of this information are now closed.

The problem is that hardware simply makes all software countermeasures irrelevant without something like KAISER.

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 18:34 UTC (Thu) by ttelford (guest, #44176) [Link]

Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt

A few hundred cycles to a syscall or interrupt is vaguely similar to the basic IPC cost of the L4 microkernel. (200-300 cycles for amd64).

Kernels are not my area of expertise, so I have to ask: if a syscall is about to become as expensive as IPC on L4... would the (theoretical) performance of the respective kernels be similar after KAISER?

KAISER: hiding the kernel from user space

Posted Nov 16, 2017 20:52 UTC (Thu) by hansendc (subscriber, #7363) [Link]

> if a syscall is about to become as expensive as IPC on L4... would the (theoretical) performance of the respective kernels be similar after KAISER?

Most of the KAISER performance impact is purely from the cost of manipulating the hardware. L4 and other kernels would pay the same cost Linux would.

It's not fair to compare a non-hardened kernel to a hardened one, though. It's apples-to-oranges.

KAISER: hiding the kernel from user space

Posted Nov 17, 2017 17:45 UTC (Fri) by valarauca (guest, #109490) [Link]

Timing attacks are becoming the bane of x64.

AVX512 adds explicit flags to suppress memory errors on scatter/gather load/store vectorized instructions which will just add another method to exploit this. The ways of _accessing_ memory you can't access on x64 just continue to grow. I really don't see how AMD64 can fix this without breaking either the page table or the debug timers.

KAISER: hiding the kernel from user space

Posted Nov 18, 2017 0:09 UTC (Sat) by anton (subscriber, #25547) [Link]

I read some performance caveats about vmaskmovps (AVX, not sure if there is an SSE equivalent) that make me think that this instruction can be used for such purposes, too.

Concerning the article, hyperbole is the standard in security news, but "a hardened kernel is no longer optional" seems to be a little extreme even so. I very much hope that stuff like this will be optional.

A possibly less costly way to mitigate attacks that try to defeat KASLR might be to map additional inaccessible address space that would respond to the attacks just like real kernel memory.

wordage nuance

Posted Nov 30, 2017 1:32 UTC (Thu) by Garak (guest, #99377) [Link]

Concerning the article, hyperbole is the standard in security news, but "a hardened kernel is no longer optional" seems to be a little extreme even so. I very much hope that stuff like this will be optional.

My reaction for a couple seconds as well till I read the next sentence. I agree that sentence is not the best way to describe things. I think it's important to highlight that security-vs-performance tradeoffs is a vast spectrum of subtle choices that *depend on the situation/deployment*. There are many different situations. Quite often a performance hit from enabling SELinux or whatever new hardening-with-five-percent-hit tactic, is absolutely not worth it. Other times your computers are trying to secure millions of dollars of cryptocurrency/etc. Most users should be taught about such nuance versus the "more secure equals always better" narrative. If something is useful to lots of people, sure it should be available as an option. But leave it to the distributors and then the end users to figure out when and where various options should be tuned.

KAISER: hiding the kernel from user space

Posted Nov 22, 2017 10:40 UTC (Wed) by mlankhorst (subscriber, #52260) [Link]

Can't it be done for nearly free using SMAP + SMEP?

Put the kernel mapping at ring3, and make the remainder of the upper 64-bits mapped RWX at ring 1 or 2.

Try to exploit timing differences then from RING 0!

Or am I thinking too simple?

KAISER: hiding the kernel from user space

Posted Jan 3, 2018 16:55 UTC (Wed) by EdRowland (guest, #120787) [Link]

Couldn't you map a dummy page into the holes to prevent timing differences between populated memory that's unreadable at ring 3 and unpopulated memory that now references a dummy page?

KAISER: hiding the kernel from user space

Posted Jan 3, 2018 17:30 UTC (Wed) by excors (subscriber, #95769) [Link]

I guess the main problem with that idea is that page tables take 8 bytes of physical memory per 4KB of virtual address space. If you want to fill up the whole ~48-bit virtual address space with distinct PTEs, you'd need 512GB of page tables.

You could try to reduce the size by e.g. using a single dummy PTE table that's shared by all the higher-level tables, instead of keeping them distinct. But an attacker can likely measure the timing difference between a page walk that fetches the PTE from cache, vs one that fetches it from RAM. If you access address A, then address A+4096, and the second one is fast (i.e. the PTE is already in the cache), you know that's using the dummy PTE, so it's still leaking information about where the kernel is.

KAISER: hiding the kernel from user space

Posted Jan 6, 2018 0:26 UTC (Sat) by ridethewave (guest, #121115) [Link]

>I guess the main problem with that idea is that page tables take 8 bytes of physical memory >per 4KB of virtual address space
Couldn't you just map each virtual address to the same physical address then?