Hardware with OoO execution already does this for you - you only fully stall if you have data dependence on the load.
sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
It's a nice presentation and reminds me of many conversations I've had with Gil over the years.I find it interesting that Richard brought up the point that we cannot have a cache line, that will be totally over written, without first fetching it from main memory. A classic example is that we constantly overwrite data in the young generation of managed runtimes, yet we don't care what was there before. Gil mentioned to me that Vega had an instruction to provide a cache line zero'ed that did not go to memory and it give a significant throughput gain. Maybe Gil can expand further on this. I wonder why we cannot have nice things? :-)
I find it interesting that Richard brought up the point that we cannot have a cache line, that will be totally over written, without first fetching it from main memory. A classic example is that we constantly overwrite data in the young generation of managed runtimes, yet we don't care what was there before. Gil mentioned to me that Vega had an instruction to provide a cache line zero'ed that did not go to memory and it give a significant throughput gain. Maybe Gil can expand further on this. I wonder why we cannot have nice things? :-)I was pretty sure that on modern Intel CPUs, if you do non-temporal (streaming) stores to aligned addresses to write out a full cache-line, it doesn't generate a read prior to the writes. They just fill the write-combining buffer, and once that buffer's full, it's written back to memory without a read first. For example: http://hurricane-eyeent.blogspot.com/2012/05/fastest-way-to-zero-out-memory-stream.htmlAm I mistaken?
I was pretty sure that on modern Intel CPUs, if you do non-temporal (streaming) stores to aligned addresses to write out a full cache-line, it doesn't generate a read prior to the writes. They just fill the write-combining buffer, and once that buffer's full, it's written back to memory without a read first. For example: http://hurricane-eyeent.blogspot.com/2012/05/fastest-way-to-zero-out-memory-stream.htmlAm I mistaken?
You're not mistaken - NT stores bypass cache. However, they're not coherent and need fences if those stores are followed by a regular store and need to be visible to other cpus. Also in the case of zero'd memory, you actually will need to write those cachelines when the constructor executes (unless constructor does no writes, which isn't interesting); Hotspot doesn't use this strategy though; instead it issues prefetch hints as part of allocation (on x64 at least).
There was a paper on different zeroing strategies for the jvm: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf
sent from my phone
Cache isn't source of truth; afterall, some memory may not be in any CPU's cache. If NT stores conditionally wrote to cache depending on cacheline presence, you could end up with some data in cache, some in memory, and both have different coherence models. I think this could get quite hairy quickly.
sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
So how would you implement this for NT stores to writeback memory? We don't want to involve caches at all since the point of NT is to not pollute them.
sent from my phone
What do you mean by Ooo (out of order) execution already does this for you ?
Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
What do you mean by Ooo (out of order) execution already does this for you ?OoO execution, as a general concept, is trying to keep the CPU execution units busy despite various hazards, cache misses being one of them (the most expensive though). If your instruction stream issues a load that misses in cache, any subsequent instructions not data dependent on that load can continue executing. Of course the number of such instructions will vary on a case by case basis, and it's still possible to stall once execution cannot proceed at all until the load retires. Given that loads from memory can take 100+ cycles, this is why memory access patterns are extremely important.
Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?
It's unclear whether you're talking about doing this over a large block of memory, or just some random 16 bytes of memory. If it's the latter, then the CPU will issue a load for a 64 byte (safe assumption on most CPUs) cacheline that covers the 16 bytes. If you're lucky, this falls all on 1 line; otherwise, you'll need 2 lines (if the memory spans two lines) -- let's assume 1 line for this example. Once that line is brought in, you'll get all 16 byte ops for free in terms of memory. In addition, the address of the destination memory is not dependent on the source, which means the CPU can issue the load for the destination memory ahead of time, possibly resulting in no cache miss when you go to store the updated value. Also keep in mind that the loads for source and destination can issue before you even arrive at these instructions (i.e. you have instructions prior to this and the CPU OoO window reaches all the way here), and if those instructions have sufficient latency, the required cachelines may already be present by the time you are ready to operate on them.
If you're talking about iterating over 16 byte chunks of a much bigger contiguous memory range, then yes, the hardware prefetcher can/will assist here if it's able to determine the access pattern (for linear walks over memory with constant stride this should be always the case). Once the prefetcher kicks in, you're sucking in memory at bandwidth rate, which is on the order of several tens of GB/s. Let's say it's 64GB/s (Haswell E5-2699 v3 is spec'd at 68GB/s). At 3GHz frequency, this delivers ~23 bytes/cycle to the cache (inter-cache delivery is a bit larger, I believe 64 bytes for loads and 32 bytes for stores per cycle). Now you may not hit theoretical/spec peaks due to bottlenecks/contention elsewhere, but you should get pretty close. As for guarantees on not starving the CPU, the only guarantees are death and taxes :). Jokes aside, linear walks through memory with hardware prefetch should be pretty good with hiding latency entirely, especially if your processing of this data is non-negligible, but YMMV of course.
As I mentioned earlier, the above illustrates why spatial locality and streaming friendly layout are paramount to avoiding cache miss related stalls. You want to (a) maximize use of a fetched cacheline (i.e. use as much, ideally all, data on it) and (b) stream through it with hardware prefetch.
On Wed, Sep 16, 2015 at 6:06 PM, ymo <ymol...@gmail.com> wrote:
Hi Vitaly.I need to grok this more. What do you mean by Ooo (out of order) execution already does this for you ?Lets say that that i need to add 1 to each byte in a 16 bytes contiguous memory and store the result in a separate location that is contiguous as well. Lets say that i am doing it one byte at a time and that i am not using AVX to make things simple. Since the cpu and cache are way faster than RAM are you saying that the prefetch will always feed your cpu with enough data that it will never stall your cpu ? is this garanteed ?
On Wednesday, September 16, 2015 at 2:41:48 PM UTC-4, Vitaly Davidovich wrote:
Hardware with OoO execution already does this for you - you only fully stall if you have data dependence on the load.
sent from my phone
On Sep 16, 2015 2:37 PM, "ymo" <ymol...@gmail.com> wrote:
One of my pet peeves with current CPUs is that if you have a miss cache you cant do anything about it. You just have to wait and waste cycles. Would it not be good if you could "check" a flag to know when the memory *is* available and if not continue do more processing ? What do you guys think ?--
On Friday, September 11, 2015 at 9:40:35 PM UTC-4, ymo wrote:
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Ok got this .. I am wondering what happens when you have tights loops like in my case. I have very tight loops going over a bunch of SoA (Structure of Arrays) and when i am done with one i just jump to the next one. I am sure in this case my only help is the prefetch. I dont think i can count on the OoO inside a tight loop can i ?
My question again can OoO see beyond whats happening in a tight loop ? Is it smart enough to look far beyond the end of the loop so that "the CPU can issue the load for the destination memory ahead of time," Meaning if i have a read after the end of my tight loop (going into my next loop) can it speculate on that and do the prefetch before i get out of the loop for both read and write destination ? I know i could "help it" by playing with the next chunk memory read/write destination before i get out of the current loop but without it i think the hw prefetch is just too dumb to predict this. Ergo the first time i access any memory in my next loop i will potentially stall .. (((
Ok now you are assuming that there is no other "jerky" neighbors cores like the video calls it trashing shared caches. So since there is no compartment based memory you are likely to never get this bandwidth because of your neighbors. So if all my cores are busy striding over memory in a *very* fast fashion the memory latency will kick in and the CPU will stall ???
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Yes, it can start speculatively executing subsequent iterations provided there are no stalls/hazards in between. Tight loops will also typically get aggressively unrolled by compilers to further maximize OoO opportunities (e.g. use more ISA registers, break dependency chains, etc).
Somewhat off-topic, but on modern x86es unrolling is no longer a panacea (it's best to stay in the µop cache or loop cache), and using more ISA registers is unnecessary due to the renamer.
--
It is my impression that modern x86 cores are actually smart enough to [under the hood] avoid reading in a cache line when it is certain that it is going to be completely overwritten.
The point is that we want these lines available in the cache as write-back memory. Non-temporal means instructing the hardware that the memory about to be written will not be read again soon. The non-temporal bit :-) This is useful for IO or writing to a graphics card where we actively want to bypass the cache. Most short lived data should never escape the cache and be written back to main memory if eviction pressure does not intervene.
--
It is my impression that modern x86 cores are actually smart enough to [under the hood] avoid reading in a cache line when it is certain that it is going to be completely overwritten.Interesting -- any more info/links for this? I know the cores will try avoiding additional RFO traffic by bringing the line into exclusive mode upfront, but haven't heard about avoiding reading the line entirely. I'm also not sure it wants to do that unless it knows the line request is non-temporal in nature.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.