ymo

unread,

Sep 11, 2015, 9:40:35 PM9/11/15

to mechanical-sympathy

https://www.youtube.com/watch?v=QBu2Ae8-8LM

Justin Bailey

unread,

Sep 16, 2015, 1:24:43 PM9/16/15

to mechanical-sympathy

Thanks for sharing - that was a really cool presentation.

For those looking for more info, the presentaiton is titled "

Dick Sites - "Data Center Computers: Modern Challenges in CPU Design""

On Friday, September 11, 2015 at 6:40:35 PM UTC-7, ymo wrote:

https://www.youtube.com/watch?v=QBu2Ae8-8LM

ymo

unread,

Sep 16, 2015, 2:37:14 PM9/16/15

to mechanical-sympathy

One of my pet peeves with current CPUs is that if you have a miss cache you cant do anything about it. You just have to wait and waste cycles. Would it not be good if you could "check" a flag to know when the memory *is* available and if not continue do more processing ? What do you guys think ?

On Friday, September 11, 2015 at 9:40:35 PM UTC-4, ymo wrote:

https://www.youtube.com/watch?v=QBu2Ae8-8LM

Vitaly Davidovich

unread,

Sep 16, 2015, 2:41:48 PM9/16/15

to mechanical-sympathy

Hardware with OoO execution already does this for you - you only fully stall if you have data dependence on the load.

sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,

Sep 16, 2015, 2:51:27 PM9/16/15

to mechanical-sympathy

It's a nice presentation and reminds me of many conversations I've had with Gil over the years.

I find it interesting that Richard brought up the point that we cannot have a cache line, that will be totally over written, without first fetching it from main memory. A classic example is that we constantly overwrite data in the young generation of managed runtimes, yet we don't care what was there before. Gil mentioned to me that Vega had an instruction to provide a cache line zero'ed that did not go to memory and it give a significant throughput gain. Maybe Gil can expand further on this. I wonder why we cannot have nice things? :-)

The points on copying data really chime for me based on what I see day-to-day.

Martin...

On Saturday, 12 September 2015 02:40:35 UTC+1, ymo wrote:

https://www.youtube.com/watch?v=QBu2Ae8-8LM

Todd Lipcon

unread,

Sep 16, 2015, 3:00:57 PM9/16/15

to mechanica...@googlegroups.com

On Wed, Sep 16, 2015 at 11:51 AM, Martin Thompson <mjp...@gmail.com> wrote:

It's a nice presentation and reminds me of many conversations I've had with Gil over the years.

I find it interesting that Richard brought up the point that we cannot have a cache line, that will be totally over written, without first fetching it from main memory. A classic example is that we constantly overwrite data in the young generation of managed runtimes, yet we don't care what was there before. Gil mentioned to me that Vega had an instruction to provide a cache line zero'ed that did not go to memory and it give a significant throughput gain. Maybe Gil can expand further on this. I wonder why we cannot have nice things? :-)

I was pretty sure that on modern Intel CPUs, if you do non-temporal (streaming) stores to aligned addresses to write out a full cache-line, it doesn't generate a read prior to the writes. They just fill the write-combining buffer, and once that buffer's full, it's written back to memory without a read first. For example: http://hurricane-eyeent.blogspot.com/2012/05/fastest-way-to-zero-out-memory-stream.html

Am I mistaken?

-Todd

Martin Thompson

unread,

Sep 16, 2015, 3:13:22 PM9/16/15

to mechanical-sympathy, to...@lipcon.org

I find it interesting that Richard brought up the point that we cannot have a cache line, that will be totally over written, without first fetching it from main memory. A classic example is that we constantly overwrite data in the young generation of managed runtimes, yet we don't care what was there before. Gil mentioned to me that Vega had an instruction to provide a cache line zero'ed that did not go to memory and it give a significant throughput gain. Maybe Gil can expand further on this. I wonder why we cannot have nice things? :-)

I was pretty sure that on modern Intel CPUs, if you do non-temporal (streaming) stores to aligned addresses to write out a full cache-line, it doesn't generate a read prior to the writes. They just fill the write-combining buffer, and once that buffer's full, it's written back to memory without a read first. For example: http://hurricane-eyeent.blogspot.com/2012/05/fastest-way-to-zero-out-memory-stream.html

Am I mistaken?

Interesting that it needs to be written back to memory, even if not read. If the cache is source of truth then it should only need to write back on eviction. This feels like DCA vs DDIO for NIC transfers. We want the equivalent of DDIO for this I think.

I wonder if fencing of the memory so it can be ordered with normal write back memory is an issue. The x86 fences are more expensive than lock instructions. Maybe someone with JVM experience can opine?

Matt Godbolt

unread,

Sep 16, 2015, 3:13:39 PM9/16/15

to mechanica...@googlegroups.com

On Wed, Sep 16, 2015 at 2:00 PM, Todd Lipcon <to...@lipcon.org> wrote:

I was pretty sure that on modern Intel CPUs, if you do non-temporal (streaming) stores to aligned addresses to write out a full cache-line, it doesn't generate a read prior to the writes. They just fill the write-combining buffer, and once that buffer's full, it's written back to memory without a read first. For example: http://hurricane-eyeent.blogspot.com/2012/05/fastest-way-to-zero-out-memory-stream.html

Am I mistaken?

That's how I understand it to work too (also if you store to uncached write-combined memory using normal stores, though that's not a common use case...). One thing to remember though if you used non-temporal stores is that they do not have the same total-store-order guarantee that normal stores do: you'll need an explicit fence for synchronization. That makes this pretty expensive in the general case.

-matt

Vitaly Davidovich

unread,

Sep 16, 2015, 3:19:03 PM9/16/15

to mechanical-sympathy

You're not mistaken - NT stores bypass cache. However, they're not coherent and need fences if those stores are followed by a regular store and need to be visible to other cpus. Also in the case of zero'd memory, you actually will need to write those cachelines when the constructor executes (unless constructor does no writes, which isn't interesting); Hotspot doesn't use this strategy though; instead it issues prefetch hints as part of allocation (on x64 at least).

There was a paper on different zeroing strategies for the jvm: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf

sent from my phone

Vitaly Davidovich

unread,

Sep 16, 2015, 3:53:19 PM9/16/15

to mechanical-sympathy

Cache isn't source of truth; afterall, some memory may not be in any CPU's cache. If NT stores conditionally wrote to cache depending on cacheline presence, you could end up with some data in cache, some in memory, and both have different coherence models. I think this could get quite hairy quickly.

sent from my phone

Martin Thompson

unread,

Sep 16, 2015, 5:51:35 PM9/16/15

to mechanica...@googlegroups.com

With the case of write-back memory, when data is in the cache and in memory, then the cache is the source of truth. If just in memory then no conflict.

Non write-back memory is not the common case. For example, everything you would do from a language such as Java is write-back memory The other memory types do exist but only confuse the subject for most people.

ymo

unread,

Sep 16, 2015, 6:06:04 PM9/16/15

to mechanical-sympathy

Hi Vitaly.

I need to grok this more. What do you mean by Ooo (out of order) execution already does this for you ?

Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Sep 16, 2015, 6:10:38 PM9/16/15

to mechanical-sympathy

So how would you implement this for NT stores to writeback memory? We don't want to involve caches at all since the point of NT is to not pollute them.

sent from my phone

Vitaly Davidovich

unread,

Sep 16, 2015, 11:01:50 PM9/16/15

to mechanical-sympathy

What do you mean by Ooo (out of order) execution already does this for you ?

OoO execution, as a general concept, is trying to keep the CPU execution units busy despite various hazards, cache misses being one of them (the most expensive though). If your instruction stream issues a load that misses in cache, any subsequent instructions not data dependent on that load can continue executing. Of course the number of such instructions will vary on a case by case basis, and it's still possible to stall once execution cannot proceed at all until the load retires. Given that loads from memory can take 100+ cycles, this is why memory access patterns are extremely important.

Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?

It's unclear whether you're talking about doing this over a large block of memory, or just some random 16 bytes of memory. If it's the latter, then the CPU will issue a load for a 64 byte (safe assumption on most CPUs) cacheline that covers the 16 bytes. If you're lucky, this falls all on 1 line; otherwise, you'll need 2 lines (if the memory spans two lines) -- let's assume 1 line for this example. Once that line is brought in, you'll get all 16 byte ops for free in terms of memory. In addition, the address of the destination memory is not dependent on the source, which means the CPU can issue the load for the destination memory ahead of time, possibly resulting in no cache miss when you go to store the updated value. Also keep in mind that the loads for source and destination can issue before you even arrive at these instructions (i.e. you have instructions prior to this and the CPU OoO window reaches all the way here), and if those instructions have sufficient latency, the required cachelines may already be present by the time you are ready to operate on them.

If you're talking about iterating over 16 byte chunks of a much bigger contiguous memory range, then yes, the hardware prefetcher can/will assist here if it's able to determine the access pattern (for linear walks over memory with constant stride this should be always the case). Once the prefetcher kicks in, you're sucking in memory at bandwidth rate, which is on the order of several tens of GB/s. Let's say it's 64GB/s (Haswell E5-2699 v3 is spec'd at 68GB/s). At 3GHz frequency, this delivers ~23 bytes/cycle to the cache (inter-cache delivery is a bit larger, I believe 64 bytes for loads and 32 bytes for stores per cycle). Now you may not hit theoretical/spec peaks due to bottlenecks/contention elsewhere, but you should get pretty close. As for guarantees on not starving the CPU, the only guarantees are death and taxes :). Jokes aside, linear walks through memory with hardware prefetch should be pretty good with hiding latency entirely, especially if your processing of this data is non-negligible, but YMMV of course.

As I mentioned earlier, the above illustrates why spatial locality and streaming friendly layout are paramount to avoiding cache miss related stalls. You want to (a) maximize use of a fetched cacheline (i.e. use as much, ideally all, data on it) and (b) stream through it with hardware prefetch.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Sep 17, 2015, 3:40:58 AM9/17/15

to mechanical-sympathy, to...@lipcon.org

Non Temporal stores are nice side-track conversation, and interesting for IO, but they tend to be irrelevant for most uses cases (malloc, objects, structs) that the program actually uses with temporal memory behavior and expectations of cache coherency and normal ordering rules.

It is my impression that modern x86 cores are actually smart enough to [under the hood] avoid reading in a cache line when it is certain that it is going to be completely overwritten. This is certainly possible for the cpu to deduce during REP STOSB or REP MOV sequences, and with AVX512 available in Intel's Skylake (and nights landing) CPU, a write is by definition known to overwrite an entire cache line (512 bits is one 64 byte line). Prefetch with intent to overwrite is not a new thing either. Alpha has this (it was called "Prefetch Write Hint 64"), and variants of SPARC, Power, and MIPS have had possible semi-equivalents too, I believe. All of these variants can be used well in member and memcpy variants, and in semi-obvious ways.

But a cache-line-wide destructive prefetch, full-line-zeroing, or full-line-store (e.g. AVX512) alone do not address the main problem we were looking to solve. The problem has to do with how these operations relate to other memory operations and fences.

What we aimed for is avoiding reading in cache lines that we knew would be stomped by newgen allocation, and to be willing to do that, ordering needed to be carefully handled. Because if you don't carefully address it, you are left with a choice: either avoid reading (saving bandwidth) but stall much more often, or avoid stalling (keeping your speed) and read much more often (wasting bandwidth). The reason the choice is such is that in order to avoid reading in about-to-be-overwritten cache lines for allocation sites without stalling, you need to do so in a far-looking prefetching pattern: the line needs to be in your cache, in exclusive mode, before you need to stall for the first actual value you place in it during allocation/initialization of whatever object you put there. This point in time can be pushed past the actual initializing store (that's what store buffers are for), but will generally not go much past the full memory fence, LOCK instruction, volatile write, or whatever else you end up doing for ordering in the course of running your program following the initialization stores.

The problem (which we solved in Vega with our CLZ instruction semantics) is that since your cache-line-destructive stores will need to be made far ahead of the initializing stores, and those cache-line-destructive are generally considered store operations (not benign prefetches), any following fence that waits for in-flight stores to complete will stall the CPU until the preceding cache-line-destructive store completes. And since each and every one of those cache-line-destructive stores tends to be a full cache miss, this generally means they take 100s of cycles to complete (even though no memory is read, some sort of cross socket latencies for coherency to feel good about having exclusive ownership will be involved).

To solve this problem, we decoupled CLZ stores from all other stores, into a separate ordering domain, with it's own fences (I know this sounds a bit like Non temporal stores, but it's not). Normal memory fences (used to establish StoreStore, StoreLoad, LoadLoad, and LoadStore orders) did not stall for pending CLZs, and CLZ fences do not stall for normal memory fences. This allows in flight CLZs to stay in flight across program ordering operations (locks, volatile writes, etc.) thereby making them practical to use as a perfect allocation prefetch pipeline that doesn't stall execution, and doesn't read any of the newgen memory in. The "trick" of having cache lines cross from in-flight CLZ to actually-used-by-initialized-object (which has to obey normal memory fences) is that within a core, data dependency between CLZs and normal stores is honored and ordered. As allocations sites touch each cache line (they store a 0 to at least one byte of each line even though the CLZ had already zeroed it), those individual potentially-still-in-flight-with-CLZ cache lines (as opposed to all in flight CLZ lines) become ordered by future memory fences. This establishes the needed order without requiring the allocation pipeline to be drained by each memory fence. CLZ fences were still required, but only on context switches (the OS took care of that).

The bottom line effect of this solution (CLZ with separate domain fencing) was immense (to the point of being hard to believe at first). We would regularly monitor memory bandwidth on Vega hardware, and an almost universal quality we noticed is that Vega machines wrote to memory than they read form it. On a sustained basis. And on pretty much all Java applications we looked at. Think about that. That's "strange" to see. Virtually all programs on all machines tend to do the opposite: read in more memory than they write. The reasons are intuitive, but they mostly have to do with the fact that even when we allocate new things (in malloc, in Java newgen on HotSpot, etc., we tend to read them in first before we wrote to them, making read bandwidth >= write bandwidth and inherent relationship. Vega broke that rule, and in a big way.

The overall effect of having hardware semantics provided by CLZ with a separate domain fencing combined a JVM that used that for pretty much all Java object allocation was a 1/3rd reduction in overall memory bandwidth per work unit (it was easy to compare by turning the feature off, and the savings varied, but were generally in that range).

The bottom line is: This one feature (or anything else that would match it) is the equivalent of adding two more memory channels to our current 4-channel Xeons, but without any of the silicon real estate, pins, wires, or power associated with actually adding those channels.

I keep looking for ways to leverage x86 stuff to achieve the same, and I keep measuring behavior of various new prefetch capabilities, but haven't yet found a combo to match...

Martin Thompson

unread,

Sep 17, 2015, 4:06:47 AM9/17/15

to mechanica...@googlegroups.com

The point is that we want these lines available in the cache as write-back memory. Non-temporal means instructing the hardware that the memory about to be written will not be read again soon. The non-temporal bit :-) This is useful for IO or writing to a graphics card where we actively want to bypass the cache. Most short lived data should never escape the cache and be written back to main memory if eviction pressure does not intervene.

I think Gil's response is a way more informed than anything I could offer on the subject.

ymo

unread,

Sep 17, 2015, 12:33:02 PM9/17/15

to mechanical-sympathy

On Wednesday, September 16, 2015 at 11:01:50 PM UTC-4, Vitaly Davidovich wrote:

What do you mean by Ooo (out of order) execution already does this for you ?

OoO execution, as a general concept, is trying to keep the CPU execution units busy despite various hazards, cache misses being one of them (the most expensive though). If your instruction stream issues a load that misses in cache, any subsequent instructions not data dependent on that load can continue executing. Of course the number of such instructions will vary on a case by case basis, and it's still possible to stall once execution cannot proceed at all until the load retires. Given that loads from memory can take 100+ cycles, this is why memory access patterns are extremely important.

Ok got this .. I am wondering what happens when you have tights loops like in my case. I have very tight loops going over a bunch of SoA (Structure of Arrays) and when i am done with one i just jump to the next one. I am sure in this case my only help is the prefetch. I dont think i can count on the OoO inside a tight loop can i ?

Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?

It's unclear whether you're talking about doing this over a large block of memory, or just some random 16 bytes of memory. If it's the latter, then the CPU will issue a load for a 64 byte (safe assumption on most CPUs) cacheline that covers the 16 bytes. If you're lucky, this falls all on 1 line; otherwise, you'll need 2 lines (if the memory spans two lines) -- let's assume 1 line for this example. Once that line is brought in, you'll get all 16 byte ops for free in terms of memory. In addition, the address of the destination memory is not dependent on the source, which means the CPU can issue the load for the destination memory ahead of time, possibly resulting in no cache miss when you go to store the updated value. Also keep in mind that the loads for source and destination can issue before you even arrive at these instructions (i.e. you have instructions prior to this and the CPU OoO window reaches all the way here), and if those instructions have sufficient latency, the required cachelines may already be present by the time you are ready to operate on them.

I am talking about a non random memory type access.My memory access is in "linear walks over memory with constant stride" over a bunch of consecutive SoA. Once i get out of a tight loop over an array i go into another tight loop over another array. However, i find the way you explained this to be quite interesting ))) My question again can OoO see beyond whats happening in a tight loop ? Is it smart enough to look far beyond the end of the loop so that "the CPU can issue the load for the destination memory ahead of time," Meaning if i have a read after the end of my tight loop (going into my next loop) can it speculate on that and do the prefetch before i get out of the loop for both read and write destination ? I know i could "help it" by playing with the next chunk memory read/write destination before i get out of the current loop but without it i think the hw prefetch is just too dumb to predict this. Ergo the first time i access any memory in my next loop i will potentially stall .. (((

If you're talking about iterating over 16 byte chunks of a much bigger contiguous memory range, then yes, the hardware prefetcher can/will assist here if it's able to determine the access pattern (for linear walks over memory with constant stride this should be always the case). Once the prefetcher kicks in, you're sucking in memory at bandwidth rate, which is on the order of several tens of GB/s. Let's say it's 64GB/s (Haswell E5-2699 v3 is spec'd at 68GB/s). At 3GHz frequency, this delivers ~23 bytes/cycle to the cache (inter-cache delivery is a bit larger, I believe 64 bytes for loads and 32 bytes for stores per cycle). Now you may not hit theoretical/spec peaks due to bottlenecks/contention elsewhere, but you should get pretty close. As for guarantees on not starving the CPU, the only guarantees are death and taxes :). Jokes aside, linear walks through memory with hardware prefetch should be pretty good with hiding latency entirely, especially if your processing of this data is non-negligible, but YMMV of course.

Ok now you are assuming that there is no other "jerky" neighbors cores like the video calls it trashing shared caches. So since there is no compartment based memory you are likely to never get this bandwidth because of your neighbors. So if all my cores are busy striding over memory in a *very* fast fashion the memory latency will kick in and the CPU will stall ???

As I mentioned earlier, the above illustrates why spatial locality and streaming friendly layout are paramount to avoiding cache miss related stalls. You want to (a) maximize use of a fetched cacheline (i.e. use as much, ideally all, data on it) and (b) stream through it with hardware prefetch.

On Wed, Sep 16, 2015 at 6:06 PM, ymo <ymol...@gmail.com> wrote:

Hi Vitaly.

I need to grok this more. What do you mean by Ooo (out of order) execution already does this for you ?

Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?

On Wednesday, September 16, 2015 at 2:41:48 PM UTC-4, Vitaly Davidovich wrote:

Hardware with OoO execution already does this for you - you only fully stall if you have data dependence on the load.

sent from my phone

On Sep 16, 2015 2:37 PM, "ymo" <ymol...@gmail.com> wrote:

One of my pet peeves with current CPUs is that if you have a miss cache you cant do anything about it. You just have to wait and waste cycles. Would it not be good if you could "check" a flag to know when the memory *is* available and if not continue do more processing ? What do you guys think ?

On Friday, September 11, 2015 at 9:40:35 PM UTC-4, ymo wrote:
https://www.youtube.com/watch?v=QBu2Ae8-8LM

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Sep 17, 2015, 1:00:48 PM9/17/15

to mechanical-sympathy

Ok got this .. I am wondering what happens when you have tights loops like in my case. I have very tight loops going over a bunch of SoA (Structure of Arrays) and when i am done with one i just jump to the next one. I am sure in this case my only help is the prefetch. I dont think i can count on the OoO inside a tight loop can i ?

Just a quick note here; OoO is a general term, and there are many CPU features that attempt to assist in it, prefetch just being one of them (other examples would be register renaming, multiple execution ports, reorder buffers, etc). So saying you cannot count on OoO but only prefetch doesn't make too much sense.

My question again can OoO see beyond whats happening in a tight loop ? Is it smart enough to look far beyond the end of the loop so that "the CPU can issue the load for the destination memory ahead of time," Meaning if i have a read after the end of my tight loop (going into my next loop) can it speculate on that and do the prefetch before i get out of the loop for both read and write destination ? I know i could "help it" by playing with the next chunk memory read/write destination before i get out of the current loop but without it i think the hw prefetch is just too dumb to predict this. Ergo the first time i access any memory in my next loop i will potentially stall .. (((

Yes, it can start speculatively executing subsequent iterations provided there are no stalls/hazards in between. Tight loops will also typically get aggressively unrolled by compilers to further maximize OoO opportunities (e.g. use more ISA registers, break dependency chains, etc). Before you "help it by playing with next chunk of memory", you should profile the code in question with PMU counters and confirm that you're actually helping. Generally speaking, hardware prefetch is getting better and better, and there's a chance that doing software prefetch hints can actually degrade performance. If you're streaming over SoA's internal arrays, you're likely getting good hardware prefetch characteristics.

Ok now you are assuming that there is no other "jerky" neighbors cores like the video calls it trashing shared caches. So since there is no compartment based memory you are likely to never get this bandwidth because of your neighbors. So if all my cores are busy striding over memory in a *very* fast fashion the memory latency will kick in and the CPU will stall ???

Yes, this is why I said "you may not hit theoretical/spec peaks due to other bottlenecks/contention". Noisy neighbors is one possible issue, but they'll be an issue no matter what -- this is but one reason people isolate critical workloads to specific socket (or cores) and try to shield noise away from there (e.g. IRQ steering). At the end of the day, this is just theory and will differ across CPU models, concrete workloads, etc. However, I suspect you'll find linear walks over memory that's fully utilized by your code (i.e. full cacheline is used, no waste) to run fairly efficiently on modern CPUs.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Matt Godbolt

unread,

Sep 17, 2015, 1:54:28 PM9/17/15

to mechanica...@googlegroups.com

I very much agree with Vitaly here, but wanted to add a little extra commentary on OoO on x86es:

On Thu, Sep 17, 2015 at 12:00 PM, Vitaly Davidovich <vit...@gmail.com> wrote:

Yes, it can start speculatively executing subsequent iterations provided there are no stalls/hazards in between. Tight loops will also typically get aggressively unrolled by compilers to further maximize OoO opportunities (e.g. use more ISA registers, break dependency chains, etc).

Somewhat off-topic, but on modern x86es unrolling is no longer a panacea (it's best to stay in the µop cache or loop cache), and using more ISA registers is unnecessary due to the renamer. Unrolling to break dependency chains is still done though (and I'm always surprised how well compilers do at this, e.g. https://goo.gl/AaUtaA cunningly unrolls to break dependencies between each iteration dividing by 100).

Vitaly Davidovich

unread,

Sep 17, 2015, 2:03:46 PM9/17/15

to mechanical-sympathy

Somewhat off-topic, but on modern x86es unrolling is no longer a panacea (it's best to stay in the µop cache or loop cache), and using more ISA registers is unnecessary due to the renamer.

Yeah, I didn't want to bog further down into the unrolling aspect being a double edged sword, in particular on archs that have a uop cache, but thanks for bringing it up. As for using more ISA registers, this is still useful to break dependency chains and allow parallel execution across multiple ports -- an example would be breaking associative operations (e.g. integer addition) up instead of accumulating all into 1 register and extending the depchain. Of course the big win with compiler unrolling is loop vectorization (when autovectorizable), but that's a separate topic.

--

Vitaly Davidovich

unread,

Sep 17, 2015, 2:21:48 PM9/17/15

to mechanical-sympathy

It is my impression that modern x86 cores are actually smart enough to [under the hood] avoid reading in a cache line when it is certain that it is going to be completely overwritten.

Interesting -- any more info/links for this? I know the cores will try avoiding additional RFO traffic by bringing the line into exclusive mode upfront, but haven't heard about avoiding reading the line entirely. I'm also not sure it wants to do that unless it knows the line request is non-temporal in nature.

Vitaly Davidovich

unread,

Sep 17, 2015, 2:41:56 PM9/17/15

to mechanical-sympathy

The point is that we want these lines available in the cache as write-back memory. Non-temporal means instructing the hardware that the memory about to be written will not be read again soon. The non-temporal bit :-) This is useful for IO or writing to a graphics card where we actively want to bypass the cache. Most short lived data should never escape the cache and be written back to main memory if eviction pressure does not intervene.

Ok, sure but you're describing temporal stores whereas I was referring to NT stores :). To circle back to the topic, a memory load may not actually go to memory if another CPU has the requested line -- in that case, the latency is much lower than going out to DRAM. In other words, the CPU already tries to avoid going to memory as much as possible :). You'll go to memory only if it's a completely cold line (i.e. no CPU has it at all).

This issue is really a managed runtime problem that requires incessant zero'ing of memory, particularly if everything is heap based like java. So, one should stop allocating so much and cycling through memory in the process! :) I don't know if Intel or other vendors have sufficient interest to introduce instructions to assist these runtimes; clearly Azul did, but they're selling java.

--

Gil Tene

unread,

Sep 18, 2015, 3:03:31 AM9/18/15

to mechanical-sympathy

On Thursday, September 17, 2015 at 11:21:48 AM UTC-7, Vitaly Davidovich wrote:

It is my impression that modern x86 cores are actually smart enough to [under the hood] avoid reading in a cache line when it is certain that it is going to be completely overwritten.

Interesting -- any more info/links for this? I know the cores will try avoiding additional RFO traffic by bringing the line into exclusive mode upfront, but haven't heard about avoiding reading the line entirely. I'm also not sure it wants to do that unless it knows the line request is non-temporal in nature.

There is a difference between not reading from memory and being non-temporal. When a write operation is known to be overwriting an entire cache line, temporal, fully coherent storage can properly be maintained by simply bringing the cache line into L1 in exclusive mode without avoiding any read from memory (or from other caches). With AVX512, each and every store of an AVX register to memory will have this quality. Watching perf. counters on a streaming write of AVX512 resigners to memory on a Skylake processor should tell us a lot about whether or not memory reads are being avoided...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Reply all

Reply to author

Forward

amazing cpu talk for a week end

ymo

Justin Bailey

Dick Sites - "Data Center Computers: Modern Challenges in CPU Design""

ymo

Vitaly Davidovich

Martin Thompson

Todd Lipcon

Martin Thompson

Matt Godbolt

Vitaly Davidovich

Vitaly Davidovich

Martin Thompson

ymo

Vitaly Davidovich

Vitaly Davidovich

Gil Tene

Martin Thompson

ymo

Vitaly Davidovich

Matt Godbolt

Vitaly Davidovich

Vitaly Davidovich

Vitaly Davidovich

Gil Tene