This is an archive of the discontinued LLVM Phabricator instance.

Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
ClosedPublic

Authored by chandlerc on Jan 4 2018, 1:14 AM.

Details

Summary

First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch -mretpoline is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with -mretpoline and additionally link the dynamic
executable and all shared libraries with LLD and pass -z retpolineplt
(or use similar functionality from some other linker). We strongly
recommend also using -z now as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to -mretpoline to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

chandlerc created this revision.Jan 4 2018, 1:14 AM
echristo accepted this revision.Jan 4 2018, 1:19 AM

LGTM of course.

-eric

This revision is now accepted and ready to land.Jan 4 2018, 1:19 AM
smeenai added a subscriber: smeenai.Jan 4 2018, 1:21 AM

Adding a bunch of subscribers who likely will care about this, and our release managers.

emaste added a subscriber: dim.Jan 4 2018, 4:42 AM

Does this CVE affect all previous versions of clang/llvm ?

The Project Zero blog post mention that this issue affects ARM CPUs, does clang/llvm need a similar fix for ARM?

Is this fix important enough to do a special 5.0.2 release for it?

Also, does this mitigation require the use of lld or is there a way to make this work with gnu ld ?

Comment for Rui about the 32-bit PLT sequence...

lld/ELF/Arch/X86.cpp
491 ↗(On Diff #128605)

Does it make sense to use something like the pushl sequence Reid came up with here? In the non-PLT case it looks like:

addl $4, %esp
pushl 4(%esp)
pushl 4(%esp)
popl 8(%esp)
popl (%esp)

So it would potentially need to be done a touch differently to work here, but maybe something like that rather than xchg?

Even if the alternative is a lot more instructions, the xchg instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.

royger added a subscriber: royger.Jan 4 2018, 9:55 AM

AFAICT this change doesn't allow to use different indirect thunks, gcc allows such usage with the -mindirect-branch=thunk-extern/-mindirect-branch-register options. The following patches for example allow Xen to choose the optimal thunk depending on the processor:

https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00112.html
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00118.html

majnemer added inline comments.
clang/lib/Basic/Targets/X86.cpp
1306 ↗(On Diff #128605)

Why is this phrased as a target feature? It just seems weird as it is not a hardware capability (all X86 hardware should support the reptoline instruction sequence unless I am mistaken).

Also, it seems that other hardware platforms beyond Intel have problems with speculation of this nature. IMO, this is more similar to things like the relocation-model and the code-model...

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
97–99 ↗(On Diff #128605)

BBToIndex is to intptr_t but BBIndex is an int, shouldn't they agree?

161 ↗(On Diff #128605)

I would think this should also agree with BBToIndex.

llvm/lib/Target/X86/X86ISelLowering.cpp
27082–27083 ↗(On Diff #128605)

I'd remove this empty default to make getRetpolineSymForReg more similar to getOpcodeForRetpoline.

chandlerc updated this revision to Diff 128629.Jan 4 2018, 10:17 AM
chandlerc marked an inline comment as done.

Fix suggested in code review and a couple of indentation fixes via clang-format
that I missed previously.

rnk added inline comments.Jan 4 2018, 10:21 AM
lld/ELF/Arch/X86.cpp
491 ↗(On Diff #128605)

This is a real concern, I checked the Intel manual, and it says:

Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose
registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
absence of the LOCK prefix or of the value of the IOPL.

One question is, do we want to try to avoid PUSH/POP MEM instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.

To avoid them completely, we could use this code sequence:

movl %ecx, (%esp) # save ECX over useless RA
movl 4(%esp), %ecx # load original EAX to ECX
movl %eax, 4(%esp) # store callee over saved EAX
movl %ecx, %eax # restore EAX
popl %ecx # restore ECX
retl # return to callee

When it comes down to it, this just implements xchg with a scratch register.

On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.

Thanks for the review!

clang/lib/Basic/Targets/X86.cpp
1306 ↗(On Diff #128605)

All x86-64 hardware supports SSE and SSE2, but we can still turn them on and off?

I think it makes some sense as it *only* impacts the target's code generation, not anything produced by Clang etc, and this seems to be a way to pass that down.

And while other architectures may want to introduce similar mitigations for Spectre, the retpoline construct itself is pretty x86-specific. My suspicion is that other architectures will potentially re-use some of the LLVM infrastructure like removing indirectbr, but want to handle the hard cases with a target-specific technique. So something like relocation-model and code-model seem weird to me as this is (at least so far) a fundamentally x86 thing.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
97–99 ↗(On Diff #128605)

I just did this out of habit to avoid padding in the map's pair object. The size of this isn't really significant unless we have over 2 billion basic blocks with their address taken....

Still, if this bothers folks I can change it either direction.

161 ↗(On Diff #128605)

Or should BBToIndex just use an 'int'?

llvm/lib/Target/X86/X86ISelLowering.cpp
27082–27083 ↗(On Diff #128605)

Good call, done.

majnemer added inline comments.Jan 4 2018, 10:35 AM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
161 ↗(On Diff #128605)

Either way works for me.

Pardon my unsolicited comment, but it seems to me that using a "retpoline" will have two unintended negative side-effects:

  1. It will ~neuter Control Flow Integrity by providing a "universal" gadget that pulls a call target off the stack and is allowed to call anything
  2. It will break performance-counter-based ROP detection

Pardon my unsolicited comment, but it seems to me that using a "retpoline" will have two unintended negative side-effects:

  1. It will ~neuter Control Flow Integrity by providing a "universal" gadget that pulls a call target off the stack and is allowed to call anything

I don't really follow...

Generally, CFI is in the business of preventing arbitrary control flow transfers. So if it is working, you can't jump to a gadget even if you have one.

And the retpoline is no more or less of a gadget than you had before. Before you had something equivalent to jmpq (%r11), now you have a retpoline that does the same thing. Both are viable gadgets, and if you can steer control flow to them they will transfer to an arbitrary location.

Also, I should note that LLVM has CFI mitigations, and they should play nice with this. It will check the target *before* calling the retpoline, just as it would check the target before doing the actual indirect call. We're even looking at ways to use CFI to make the overhead of retpoline much lower. While CFI introduces its own overheads, the overhead from CFI -> CFI+retpoline might be much lower than normal -> normal+retpoline. And you'll get CFI. =]

  1. It will break performance-counter-based ROP detection

Almost certainly. But what else can we do?

Looks like this uses a different command-line argument than the preliminary patchset for GCC, which is rather unfortunate.
http://git.infradead.org/users/dwmw2/gcc-retpoline.git/shortlog/refs/heads/gcc-7_2_0-retpoline-20171219
It'd be awesome if both compilers would somehow end up using the same command-line spelling. Maybe GCC would like to switch to "-mretpoline"? Have we been talking to them?

However, I do suspect the upstream kernel folks will need the equivalent of "-mindirect-branch=thunk-extern" mode (where the compiler doesn't actually emit the thunks, only the calls to them). This lets them define their own thunks which have "alternatives" annotations in them to allow them to be runtime patched out. (e.g. as in https://lkml.org/lkml/2018/1/4/419). That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks. And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.

Looks like this uses a different command-line argument than the preliminary patchset for GCC, which is rather unfortunate.
http://git.infradead.org/users/dwmw2/gcc-retpoline.git/shortlog/refs/heads/gcc-7_2_0-retpoline-20171219
It'd be awesome if both compilers would somehow end up using the same command-line spelling. Maybe GCC would like to switch to "-mretpoline"? Have we been talking to them?

We have talked to them, but I don't know what the rationale is behind their flag names and there isn't an actual discussion about them that we can comment on. I *strongly* prefer having a single simple flag such as -mretpoline.

However, I do suspect the upstream kernel folks will need the equivalent of "-mindirect-branch=thunk-extern" mode (where the compiler doesn't actually emit the thunks, only the calls to them). This lets them define their own thunks which have "alternatives" annotations in them to allow them to be runtime patched out. (e.g. as in https://lkml.org/lkml/2018/1/4/419).

If the upstream kernel folks want it, and can spell out how it should work, no problem. We've worked on this set of functionality in close coordination with Paul Turner, and this wasn't at the top of the queue of things for us to do, but we can add it whenever/however it is needed. That said....

That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks.

I don't think this is true. We need to document LLVM's semantics and the thunks required. If they match GCCs, cool. If not and a user provides a flag to request an external thunk, then they have to give LLVM one that matches LLVM's semantics. Since they control the name, they can even have thunks with each compiler's semantics and different names. While in practice, I expect the semantics to match, I don't think we should be in the business of forcing this into the ABI. We already have waaaay to much in the ABI. If we come up with a better way to do retpoline in the future, we should rapidly switch to that without needing to coordinate about names and semantics across compilers.

And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.

I strongly disagree here. Again, this should *not* be part of the ABI and it should not be something that we have to engineer to match exactly with other compilers. We should even be free to add __llvm_retpoline2_foo() thunks in the future with new semantics with zero compatibility concerns.

Is there any plan (or willingness?) to backport this further, to 4.0.0 and even 3.4.1?

Is there any plan (or willingness?) to backport this further, to 4.0.0 and even 3.4.1?

I don't really have plans for it, but it should be relatively straightforward to backport at least some of it across a couple years worth of llvm releases.

sanjoy added inline comments.Jan 4 2018, 11:51 AM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
114 ↗(On Diff #128629)

BBIndex needs to start from 1 I think since "no label is equal to the null pointer".

134 ↗(On Diff #128629)

Do we care about inline assembly here? The langref says "Finally, some targets may provide defined semantics when using the value as the operand to an inline assembly, but that is target specific."

There are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

When a function called using a retpoline returns, will the ret be predicted correctly?

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113 ↗(On Diff #128629)

blockaddresses are uniqued, so no block should ever have more than one blockaddress user. So this should probably be an assertion.

That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks.

I don't think this is true. We need to document LLVM's semantics and the thunks required. If they match GCCs, cool. If not and a user provides a flag to request an external thunk, then they have to give LLVM one that matches LLVM's semantics. Since they control the name, they can even have thunks with each compiler's semantics and different names. While in practice, I expect the semantics to match, I don't think we should be in the business of forcing this into the ABI. We already have waaaay to much in the ABI. If we come up with a better way to do retpoline in the future, we should rapidly switch to that without needing to coordinate about names and semantics across compilers.

Right, I shouldn't have said "standardize", I did actually mean "document".

And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.

I strongly disagree here. Again, this should *not* be part of the ABI and it should not be something that we have to engineer to match exactly with other compilers.

No, I agree with you, it does not _need_ to match. However, if the semantics are the same as GCC's thunks (which they do appear to be), it would be _good_ to coordinate with eachother, simply in order to make users' lives easier. Which is the same reason to coordinate the command-line UI. Neither are required, but both would be good.

We should even be free to add __llvm_retpoline2_foo() thunks in the future with new semantics with zero compatibility concerns.

It wouldn't have _zero_ concerns due to externally-provided thunks -- they'll need to provide new ones to use the new compiler version. But probably that's okay.

ruiu added inline comments.Jan 4 2018, 1:02 PM
lld/ELF/Arch/X86.cpp
491 ↗(On Diff #128605)

I didn't know that xchg is so slow. There's no reason not to use push/pop instructions to swap a word at the stack top and a register. Since this is a PLT header (and not a PLT entry), the size of the instrcutions doesn't really matter. Do you want me to update my patch?

lld/ELF/Arch/X86_64.cpp
517–525 ↗(On Diff #128629)

Chander,

I also noticed we can improve instructions here. We can use the following instructions instead so that the jump target to lazy-resolve PLT is aligned to 16 byte. I can make a change now if you want.

  mov foo@GOTPLT(%rip), %r11
  callq next
loop: pause
  jmp plt+32; .align 16
  pushq <relocation index> // lazy-resolve a PLT entry
  jmpq plt[0]
llvm/lib/Target/X86/X86InstrCompiler.td
1160 ↗(On Diff #128629)

Hi Chandler,
Do you really want to use "NotUseRetpoline" here? It will match RETPOLINE_TCRETURN32 then even it is not an indirect branch.
I guess the following test case will crash llc.

target triple = "i386-unknown-linux-gnu"

define void @FOO() {
  ret void
}

define void @FOO1() {
entry:
  tail call void @FOO()
  ret void
}
llvm/lib/Target/X86/X86RetpolineThunks.cpp
121 ↗(On Diff #128629)

You will create thunk function even if it is not necessary? For example for " tail call void @FOO()"?

chandlerc marked an inline comment as done.Jan 4 2018, 2:16 PM

That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks.

I don't think this is true. We need to document LLVM's semantics and the thunks required. If they match GCCs, cool. If not and a user provides a flag to request an external thunk, then they have to give LLVM one that matches LLVM's semantics. Since they control the name, they can even have thunks with each compiler's semantics and different names. While in practice, I expect the semantics to match, I don't think we should be in the business of forcing this into the ABI. We already have waaaay to much in the ABI. If we come up with a better way to do retpoline in the future, we should rapidly switch to that without needing to coordinate about names and semantics across compilers.

Right, I shouldn't have said "standardize", I did actually mean "document".

And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.

I strongly disagree here. Again, this should *not* be part of the ABI and it should not be something that we have to engineer to match exactly with other compilers.

No, I agree with you, it does not _need_ to match. However, if the semantics are the same as GCC's thunks (which they do appear to be), it would be _good_ to coordinate with eachother, simply in order to make users' lives easier. Which is the same reason to coordinate the command-line UI. Neither are required, but both would be good.

We should even be free to add __llvm_retpoline2_foo() thunks in the future with new semantics with zero compatibility concerns.

It wouldn't have _zero_ concerns due to externally-provided thunks -- they'll need to provide new ones to use the new compiler version. But probably that's okay.

Sure, sounds like we're generally in agreement here. I'll take a stab at supporting external thunks and skimming emission, shouldn't be too hard.

lld/ELF/Arch/X86.cpp
491 ↗(On Diff #128605)

Yeah, if you can email me a delta on top of this, I'll fold it in. Thanks!

(Suggesting that in part to make backports a bit easier...)

lld/ELF/Arch/X86_64.cpp
517–525 ↗(On Diff #128629)

Will this break the 16-byte property when combined with -z now? That seems more relevant. Given that the lazy-binding is intended to only run once, I don't think its worth burning much space to speed it up.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113 ↗(On Diff #128629)

I just didn't want to hard code that assumption, but I can if you prefer.

134 ↗(On Diff #128629)

I mean, yes, but also no. ;]

It would be nice to maybe preserve inline asm uses of blockaddr and not any others. And then force people to not rinse their blackaddr usage through inline asm and mix that with -mretpoline. That would allow the common usage I'm aware of to remain (showing label addresses in crash dumps in things like kernels) and really allow almost any usage wholly contained within inline asm to continue working perfectly.

But it seemed reasonable for a follow-up. That said, maybe its not too complex to add now...

llvm/lib/Target/X86/X86InstrCompiler.td
1160 ↗(On Diff #128629)

Reid, could you take a look?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
121 ↗(On Diff #128629)

There is a FIXME above about being more careful when creating these.

sanjoy added inline comments.Jan 4 2018, 2:38 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
134 ↗(On Diff #128629)

What do you think about report_fatal_erroring here if you encounter an inline assembly user? That seems somewhat more friendly than silently "miscompiling" (in quotes) inline assembly.

rnk added inline comments.Jan 4 2018, 2:50 PM
llvm/lib/Target/X86/X86InstrCompiler.td
1160 ↗(On Diff #128629)

Here's a small diff on top of this one that fixes and tests it: https://reviews.llvm.org/P8056

efriedma added inline comments.Jan 4 2018, 3:04 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113 ↗(On Diff #128629)

If we violate that assumption, something has gone very wrong (either we've created a blockaddress in the wrong context, or we leaked a blockaddress from the context, or we have a blockaddress with an invalid block+function pair).

Although, on a related note, you might want to check Constant::isConstantUsed(), so we don't generate indexes for blockaddresses which aren't actually referenced anywhere.

dexonsmith added inline comments.Jan 4 2018, 3:33 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113 ↗(On Diff #128629)

FWIW, I agree with Eli that it's fundamentally bad if constants haven't been uniqued properly.

davidxl added a subscriber: davidxl.Jan 4 2018, 4:02 PM
davidxl added inline comments.
clang/lib/Basic/Targets/X86.cpp
1306 ↗(On Diff #128605)

This can go either way.

Fundamentally, the option is used to 'disable' indirect branch predictor so it sounds like target independent. However the mechanism used here 'retpoline' is very x86 specific -- it forces the HW to use call/ret predictor so that the predicted target is 'trapped' in a loop.

chandlerc updated this revision to Diff 128675.Jan 4 2018, 4:22 PM
chandlerc marked 3 inline comments as done.

Fix several bugs caught in code review, as well as implementing suggestions.

I'm looking at adding support for external thunks next.

chandlerc updated this revision to Diff 128678.Jan 4 2018, 4:30 PM
chandlerc marked 6 inline comments as done.

Add another minor suggestion that I forgot previously.

ruiu added a comment.Jan 4 2018, 4:31 PM

Chandler,

Please apply https://reviews.llvm.org/D41744 to this patch. It includes the following changes:

  1. xchg is replaced with mov/pop instructions
  2. x86-64 lazy PLT relocation target is now aligned to 16 byte
  3. the x86-64 PLT header for lazy PLT resolution is shrunk from 48 bytes to 32 bytes (which became possible by utilizing the space made by (2))
chandlerc added inline comments.Jan 4 2018, 4:32 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
134 ↗(On Diff #128629)

I actually think that'd be much less friendly.

The primary use case I'm aware of is for diagnostics, and I'd much rather have low-quality diagnostics in some random subsystem than be unable to use retpolines...

I've added a FIXME about this but I'm pretty nervous about breaking round-trip-through-inline-asm to repair a quality loss in diagnostics. But we can revisit this if needed.

chandlerc added inline comments.Jan 4 2018, 4:32 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113 ↗(On Diff #128629)

Yeah, I'm convinced. It actually makes the code *way* simpler. I've implemented all of the suggestions here.

llvm/lib/Target/X86/X86InstrCompiler.td
1160 ↗(On Diff #128629)

Awesome, applied!

ab accepted this revision.Jan 4 2018, 5:00 PM
ab added a subscriber: ab.

Clang/LLVM pieces LGTM. Thanks for doing this, people.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
56 ↗(On Diff #128629)

Nit: Unnecessary 'private'?

llvm/lib/Target/X86/X86ISelLowering.cpp
27120 ↗(On Diff #128629)

I wonder: should this also check the CSR regmask? There are a couple CCs that do preserve R11, but it probably doesn't matter for -mretpoline users. E.g.,

define void @t(void ()* %f) {
  call cc 17 void %f()
  ret void
}

Worth an assert?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
119 ↗(On Diff #128629)

'retq' at the end

171 ↗(On Diff #128629)

Nit: newlines between functions

183 ↗(On Diff #128629)

'noinline' makes sense, but I'm curious: is it necessary?

llvm/lib/Target/X86/X86Subtarget.h
344 ↗(On Diff #128629)

Nit: "Use retpoline to avoid ..."

MaskRay added a subscriber: MaskRay.Jan 4 2018, 5:58 PM
chandlerc updated this revision to Diff 128698.Jan 4 2018, 6:06 PM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Also adds Rui's work to provide more efficient PLT thunks.

Also addresses remaining feedback from Ahmed.

In D41723#967861, @ruiu wrote:

Chandler,

Please apply https://reviews.llvm.org/D41744 to this patch. It includes the following changes:

  1. xchg is replaced with mov/pop instructions
  2. x86-64 lazy PLT relocation target is now aligned to 16 byte
  3. the x86-64 PLT header for lazy PLT resolution is shrunk from 48 bytes to 32 bytes (which became possible by utilizing the space made by (2))

Done, and awesome!

Reid, do you want me to adjust the code for the 32-bit _push thunk?

chandlerc updated this revision to Diff 128700.Jan 4 2018, 6:12 PM
chandlerc marked 16 inline comments as done.

Add the new test file for external thunks.

chandlerc marked an inline comment as done.Jan 4 2018, 6:13 PM

All comments addressed now I think, thanks everyone! See some thoughts about a few of these below though.

llvm/lib/Target/X86/X86ISelLowering.cpp
27120 ↗(On Diff #128629)

We should already have the assert below? If we get through this and have no available reg, we assert that we're in 32-bit.

That said, this really is a case that will come up. We shouldn't silently miscompile the code. I've changed both cases to report_fatal_error because these are fundamental incompatibilities with retpoline.

llvm/lib/Target/X86/X86RetpolineThunks.cpp
183 ↗(On Diff #128629)

Nope, not necessary. I've removed it.

Actually, I don't think any of these are necessary because we hand build the MI and we are the last pass before the asm printer. But I've left the others for now. They seem at worst harmless.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

ruiu added a comment.Jan 4 2018, 9:43 PM

Chandler, please apply https://reviews.llvm.org/P8058 to address latest review comments.

eax added a subscriber: eax.Jan 5 2018, 12:30 AM
chandlerc updated this revision to Diff 128715.Jan 5 2018, 1:51 AM
chandlerc marked an inline comment as done.

Rebase and fold in the latest batch of changes from Rui based on the review
from Rafael.

royger added a comment.Jan 5 2018, 1:58 AM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Adding options is actually cheaper than adding option parsing like this...

I'm hesitant to name the option in the external case something completely generic, because it isn't as generic as it seems IMO. We hide indirect calls and indirect branches, but we *don't* hide returns which are technically still indirects... The reason is that we are assuming that rewriting arbitrary indirects into a return is an effective mitigation. If it weren't, we'd also need to rewrite returns. This is a large component of the discussion in the linked detailed article around RSB filling -- we have to use some *other* mechanisms to defend ret, and then we rely on them being "safe".

As a consequence to all of this, this flag to clang shouldn't be used for *arbitrary* mitigations. It really should be used for mitigations stemming from return-based trampolining of indirect branches.

If we want truly generic thunk support in the future, we should add that under a separate flag name, and it should likely apply to returns as well as other forms of indirects. I know there has been some discussion about using lfence to protect indirect branch and call on AMD processors, but I'm still extremely dubious about this being an advisable mitigation. I have *significantly* more confidence in the retpoline technique being a truly effective mitigation against the attacks, and the performance we observe in the wild really is completely acceptable (for the Linux kernel, hard to even measure outside of contrived micro-benchmarks).

chandlerc updated this revision to Diff 128721.Jan 5 2018, 2:46 AM

Add explicit checks for various lowerings that would need direct retpoline
support that are not yet implemented. These are constructs that don't show up
in typical C, C++, and other static languages today: state points, patch
points, stack probing and morestack.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

Sorry for failing to respond previously...

I've audited all of these. They should have already triggered a hard failure if we made it through ISel though. I've added explicit and more user-friendly fatal errors if we hit these cases.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

Phabricator keeps the summary field of a revision independent of the commit message (beyond the initial upload). You can use arc diff --verbatim to force the summary to be re-copied from the commit message. However, running that will also update the Reviewers and Subscribers fields with their values in the commit message, which would probably drop a lot of subscribers in this case, so I wouldn't recommend it. You can always Edit Revision and update the summary manually if you really want.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

To be explicit, you can ignore statepoints and patchpoints for the moment. As the only major user of this functionality, we'll follow up with patches if needed. We're still in the process of assessing actual risk in our environment, but at the moment it looks like we likely won't need this functionality. (Obviously, subject to change as we learn more.)

Any more comments? I'd love to land this and start the backporting and merging process. =D

FWIW, I've built and linked the test suite with this in various modes, both 64-bit and 32-bit, and have no functional failures. I've not done any specific performance measurements using the LLVM test suite, but you can see our initial (very rough) performance data in the OP.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

(Er sorry, have two updates from Rafael I need to actually include, will be done momentarily...)

chandlerc updated this revision to Diff 128834.Jan 5 2018, 5:45 PM

Teach the thunk emission to put them in comdats and enhance tests to verify
this.

Also add test coverage for nonlazybind calls which on 64-bit architectures
require retpoline there despite no user written indirect call. This already
worked, but Rafael rightly pointed out we should test it.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

I do understand that it applies only to privileged code.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

I expect you are measuring this in a normal timesharing environment, which is not what we have (more below).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

Right.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

As I said above, I expect you are measuring this on a normal timesharing system. A game console is not a normal timesharing environment. We reserve a core to the system and the game gets the rest of them. My understanding is that the game is one process and it essentially doesn't get process-context-switched (this understanding could be very flawed) although it probably does get thread-context-switched (which should still be cheap as there is no protection-domain transition involved). If there basically aren't any process context switches while running a game (which is when performance matters most) then moving even a small in-process execution cost to a larger context-switch cost is a good thing, not a bad thing. We have hard-real-time constraints to worry about.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

I appreciate all the work you've done on this, and your sharing of your findings. Unfortunately on our side we are playing catch-up as we were unaware of the problem before the public announcements and the publication of this patch. We're definitely doing research and our own measurements to understand what is our best way forward.

(If it wasn't obvious, I'm not standing in the way of the patch, just noting the AMD thing which clearly can be a follow-up if it seems to be a better choice there.)

MatzeB added a subscriber: MatzeB.Jan 8 2018, 1:20 PM
MatzeB added inline comments.
llvm/lib/Target/X86/X86RetpolineThunks.cpp
82–83 ↗(On Diff #128834)

We need this pass for "correctness" and should never skip it I think.

lattera added a subscriber: lattera.Jan 9 2018, 6:23 AM

FYI: I've imported this patch into a feature branch in HardenedBSD's playground repo. All of HardenedBSD (world + kernel) is compiled with it. I've been running it on my laptop for the past couple days without a single issue. I've enabled it locally on my box for a few applications, notably Firefox and ioquake3. Both work without issue. I plan to start an experimental package build tomorrow with it applied to the entire ports tree (around 29,400 packages).

chandlerc updated this revision to Diff 129849.Jan 15 2018, 5:50 AM

Re-implement the indirectbr removal pass based on Rafael's suggestion. It now
creates a single switch block and threads all the indirictbr's in the function
through that block. This should give substantially smaller code especially in
the case of using retpoline where the switch block expands to a search tree
that we probably don't want to duplicate 100s of times.

chandlerc updated this revision to Diff 129850.Jan 15 2018, 5:55 AM
chandlerc marked an inline comment as done.

Couple of other minor fixes.

Ok, I think this is all done. Rafael, I think I've implemented your suggestion as well and it still passes all my tests (including the test-suite) and a bunch of internal code I have.

Any last comments?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
82–83 ↗(On Diff #128834)

Agreed and done.

jyknight added a comment.EditedJan 15 2018, 6:45 PM

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Thanks for digging all of this up, but I have to say that it would be really awesome of folks from AMD would actually comment on this thread and/or patch rather than us relaying things 2nd and 3rd hand....

I'll look at implementing this, but I'm not super thrilled to change so much of the code at this point. The code as-is is secure, and merely power-inefficient on AMD chips. I'd like to fix that, but if it creates problems in testing, I'm inclined to wait for AMD to actually join the discussion.

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Ahh, I see my email crossed yours, sorry.

Have you tested adding 'lfence' and this patch on any AMD platforms? Do you have any results? Can you confirm that these patches are actually working?

I'll look at implementing this, but I'm not super thrilled to change so much of the code at this point.

It seems potentially reasonable to me to commit as-is, and do that as a follow-on patch.

I still think the interface should be fixed, so that later on when lfence (or other thunks) is added clang doesn't end up with:

-mretpoline_external_thunk and -mlfence_external_thunk

Which doesn't make sense. IMHO the interface should be something like:

-mindirect-thunk={retpoline,lfence,external,...}

Or similar, ie: a single option that allows the user to select the thunk to use.

gcc is using such interface:

https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01041.html

anujm1 added a subscriber: anujm1.Jan 16 2018, 5:35 PM

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Ahh, I see my email crossed yours, sorry.

Have you tested adding 'lfence' and this patch on any AMD platforms? Do you have any results? Can you confirm that these patches are actually working?

Given the lack of test case for this issue, We just tested SPEC2k17 with GCC 'retpoline' patch ('pause' vs 'pause+lfence') on AMD Zen.
There is no overhead on adding 'lfence' after 'pause'.

Also for AMD we need 'lfence' as it is a dispatch serializing instruction.
Please refer: https://www.spinics.net/lists/kernel/msg2697621.html.

chandlerc updated this revision to Diff 130532.Jan 18 2018, 6:14 PM

Add the lfence instruction to all of the speculation capture loops, both
those generated by LLVM and those generated by LLD for the PLT.

This should ensure that both Intel and AMD processors stop speculating these
loops rather than continuing to consume resources (and power) uselessly.

Ping! Would really like to get this landed, but as it has changed somewhat, looking for a final round of review....

MatzeB accepted this revision.Jan 19 2018, 5:11 PM

LGTM, I'm all for getting this into the LLVM tree, we can fine tune later as necessary.

This is the first widely used machine module pass. Adding a module pass into the pipeline means we will have all the MI in memory at the same time rather than creating MI for 1 function at a time and freeing it after emitting each function. So it would be good to keep an eye on the compilers memory consumption... Maybe we can find a way to refactor the codegen pipeline later so that we can go back to 1 function at a time and still have a way to emit some new code afterwards...

llvm/include/llvm/CodeGen/TargetPassConfig.h
412–413 ↗(On Diff #130532)

Even though this documentation is mostly copied, how about changing it to:

Targets may add passes immediately before machine code is emitted in this callback.
This is called later than addPreEmitPass().
414 ↗(On Diff #130532)

I find the name addEmitPass() misleading as it isn't adding the assembly emission passes. The best bad name I can think of right now is addPreEmit2(), in the spirit of the existing addPreSched2() callback...

llvm/lib/Target/X86/X86RetpolineThunks.cpp
83–84 ↗(On Diff #130532)

There shouldn't be any way to use any Target/X86 without also having a TargetPassConfig in the pipeline. An assert should be enough.

chandlerc marked 3 inline comments as done.Jan 22 2018, 1:29 AM

Thanks Matthias and Eric!

After a discussion on IRC, the general conclusion I came to is that let's try this as-is with a clean machine module pass. If we get reports of issues, it should be pretty straight forward to hack it up to use machine function passes instead, but it seems quite a bit cleaner as-is, and everyone seems very hopeful that this will not be an issue in practice. (And none of the testing we've done so far indicate any issues with memory consumption.)

Seems like this has pretty good consensus now (Matthias, Rafael, several others). I'm planning to land this first thing tomorrow morning (west coast US time) and then will start working on backports to release branches.

Thanks again everyone, especially those who helped author large parts of this. Will of course keep an eye out for follow-ups or other issues.

llvm/include/llvm/CodeGen/TargetPassConfig.h
414 ↗(On Diff #130532)

I suspect the right thing to do is to change the name of addPreEmitPass to something more representative of reality and document that correctly. Then we can use the obvious name of addPreEmitPass.

But I'd prefer to do the name shuffling of *existing* APIs in a follow-up patch. I've used addPreEmitPass2 here even though I really dislike that name, and left a FIXME.

This revision was automatically updated to reflect the committed changes.
chandlerc marked an inline comment as done.

@chandlerc is MacOS support omitted intentionally (ie. not implemented yet)? When I try using the -mretpoline flag on High Sierra I get the following:

fatal error: error in backend: MachO doesn't support COMDATs, '__llvm_retpoline_r11' cannot be lowered.

@chandlerc is MacOS support omitted intentionally (ie. not implemented yet)?

Not at all, I just don't have a Mac system to test with...

When I try using the -mretpoline flag on High Sierra I get the following:

fatal error: error in backend: MachO doesn't support COMDATs, '__llvm_retpoline_r11' cannot be lowered.

This is just that I unconditionally used comdats and that doesn't work on MachO. We should use some other lowering strategy to ensure the thunks are merged by the Mac linker. I'm not an expert there, so would defer to others like Ahmed. Happy to review a patch fixing it though.

GGanesh removed a subscriber: GGanesh.
GGanesh added a subscriber: GGanesh.