y-cruncher - A Multi-Threaded Pi-Program |
||
From a high-school project that went a little too far...By Alexander J. Yee |
(Last updated: March 13, 2024)
Shortcuts:
The first scalable multi-threaded Pi-benchmark for multi-core systems...
How fast can your computer compute Pi?
y-cruncher is a program that can compute Pi and other constants to trillions of digits.
It is the first of its kind that is multi-threaded and scalable to multi-core systems. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts.
y-cruncher has been used to set several world records for the most digits of Pi ever computed.
Current Release:
Windows: Version 0.8.4 Build 9538 (Released: February 22, 2024)
Linux : Version 0.8.4 Build 9538 (Released: February 22, 2024)
Official Mersenneforum Subforum.
Official HWBOT forum thread.
Limping to a new Pi Record of 105 Trillion Digits: (March 14, 2024) - permalink
If the bug that was fixed in v0.8.3.9533 made you suspect something was happening, you were right.
Jordan Ranous from StorageReview has followed up last year's speedrun of Pi with a record this time: 105 trillion digits of Pi
Originally, this was supposed to be another speedrun with slightly more digits to set the record. But it turned into anything but that as it ran into a multitude of problems. So unlike the majority of the past Pi records done with y-cruncher, this one did not go smoothly. And for the first time in over 10 years, a Pi record attempt required developer intervention to complete. So unlike the majority of past records, I was very much involved in this one.
But in the end, a record is still a record. It doesn't need to be pretty.
Computation:
Decimal Digits: 105,000,000,000,000 Total Time: 75 days
(December 14, 2023 - February 27, 2024)
CPU: 2 x AMD Epyc 9754 Bergamo (256 cores, SMT off) Memory: 1.5 TB DDR5 Storage: Swap: 30 x Solidigm P5316
Digit Output: 6 x Solidigm P5316
OS: Windows Server 2022 (21H2) Software: y-cruncher v0.8.3.9532 (with developer fixes) Validation File: Validation File
Verification (x2):
Total Time: 4 days
(January 2, 2024 - January 6, 2024)
CPU: Intel Core i9 7900x @ 3.6 GHz (AVX512) Memory: 128 GB DDR4 OS: Windows 10 (22H2) Software: y-cruncher v0.8.4 (early development build) Screenshot: Screenshot
The Original Plan:
As stated, the goal was to repeat last year's 100 trillion digit speedrun, but with better hardware and better software:
Last Year's 100 Triillion Digit Speedrun This 105 Trillion Digit Record Run CPU: 2 x AMD Epyc 9654 (Genoa)
192 Cores
2 x AMD Epyc 9754 (Bergamo)
256 Cores
Memory: 1.5 TB DDR5 1.5 TB DDR5 Storage: 19 x Solidigm P5316 36 x Solidigm P5316 Software: y-cruncher v0.7.10.9513 y-cruncher v0.8.3.9532 + fixes
The main improvements were the larger number of cores, greater storage bandwidth, and all the recent improvements to y-cruncher including the SSD ones.
This was supposed to only take a month, but as mentioned, things did not go as planned and it ended up dragging out for more than 2 months - longer than last year's computation.
Below Expected Performance:
A couple weeks into the computation, the first thing that began to go awry was the performance.
Poor Sustained I/O Bandwidth (blue = read, purple = write) |
Non-direct Attached Storage:
The first problem was the indirect attached storage. In last year's computation, all 19 SSDs were direct attached to the motherboard. In this computation, 30 SSDs were used for swap of which 6 were direct attached while the remaining 24 were external and connected via a pair of PCIe connections. It is these PCIe connections that became the bottleneck.
The result is that 6 of the SSDs were very fast while the remaining 24 were slow. Thus on every burst of I/O, the bandwidth would start fast. Once the 6 fast SSDs finished, the remaining 24 were still running at a much lower speed.
Write bandwidth was also poor due to QLC SSDs having inconsistent performance depending on the state of their SLC cache. So on every burst of writes, there would always be 1 or 2 SSDs which take much longer to finish than the rest (i.e. "stragglers"). And the nature of RAID-0 is that performance is bottlenecked by the slowest drive.
It's unclear if this is a regression from last year's run, but it certainly didn't show up during the pre-run I/O benchmarks when the SSDs were mostly empty and had a clean cache state.
Amdahl's Law and Zen4 Hazard:
The second problem was the Amdahl's Law and Zen4 Hazard problem. While we knew this would be a problem before starting the run, the fix for it was not going to be ready for months. But since this problem has already existed for years, it isn't considered a regression from last year's 100 trillion digit speedun. But it did contribute to missing the original 1 month target.
As of today, we're unable to quantify how much it slowed down the computation. But empirically speaking, whenever Jordan checked on the computation, it was showing 0% CPU+disk utilization more often than not. And unlike Emma's records with Google a few years back, there was no logging of CPU and disk utilization. So we don't have the data to analyze.
Suboptimal Algorithm Selection:
The last performance issue showed up late in the computation and is a bug in y-cruncher. After I was given remote access to the machine and was able to monitor the computation more closely, I noticed that it was often choosing very poor algorithm parameters. The result was a noticeable slow-down on some of the larger operations. While it's also difficult to quantify its effect on the whole computation, it definitely slowed down the final division and square root steps by around 50%.
This bug was then fixed for v0.8.4. But I decided not to apply it to this computation since it was nearing completion and it was safer to not touch it anymore than was absolutely necessary.
In the end, it's difficult to know how much the computation was slowed down by all these issues. At least not without rerunning it with all the software fixes and hardware adjustments in place. But of course the problems didn't stop with just poor performance...
y-cruncher crashes with no error message. |
Late Computation Crash:
Near the end of the computation, the program mysteriously crashed without an error message. When it was restarted from the last save point, it crashed again at the same place a day later. So the alarm bells started going off because it's clear at this point that there's an issue with the program.
To make matters worse:
The vast majority of subtle hardware errors do not crash the program. Instead they trigger redundancy check failures which always get printed on the screen. But there was nothing. The application process literally just died.
But there was one thing I remembered: In Windows, crash handlers do not work if the crash is from heap corruption. Thus if y-cruncher crashed due to heap corruption, the crash handler that creates its own minidump does not run. Furthermore, early in January, I fixed a bug for v0.8.4 that could cause y-cruncher to crash if an error occurred at the right time.
When y-cruncher encounters an error, it throws an exception which is supposed to propagate up the execution stack until it is caught at higher level and printed out. What was happening instead was an async object lifetime bug. The exception causes a premature stack unwind on the thread that owns an object. So the object gets destructed while other threads are still using it. (yeah, typical C++ bs) Thus the use-after-free would corrupt the heap and crash the program - bypassing the crash handler and terminating the progran before it has a chance to print the error message.
Cross-checking the exact point of the crash confirms that it was in the right place where this could happen. So that probably explains the silent crash.
But what was the original exception? Obviously I wouldn't know until I backported the fix from v0.8.4 to v0.8.3 and reran the computation to the failure spot and (hopefully) have an error message to see this time.
(By this point, I had been given full remote access to the machine to do whatever I needed with the machine and the (incomplete) computation. So in addition to the patched binary, the program was now running under a debugger. Because yes, we're going to compute 105 trillion digits of Pi on Windows through the Visual Studio IDE in case of further shenanigans.)
But based on the last printed status line, I had a rough idea of where the program was during the error. Thus I kind of already knew what the error message would be. (The dreaded "Coefficient is too large" error.) So I wasn't going to wait a day for the program to hit the error again and troll me with the same message that it trolls overclockers with bad ram. Thus I started looking into the code near the crash and the likely sources of error. While I wasn't expecting to find the actual error, I was at least looking for places to add debugging code.
Then after going through many spots, mentally injecting errors, and simulating how the program would fail, I found one spot that matched what was happening. It was deep inside the AVX512 codepath of the N63 multiply algorithm.
Without getting into the details of the code and the algorithm that it implemented (since that would require a multi-page blog to cover), there was some floating-point arithmetic that was incorrectly written in a way that led to loss of precision due to destructive cancellation. It was so bad that I still don't understand how it ever worked at all. Yet the code ran correctly in my manual tests, passed weeks and weeks of unit and integration tests, and even passed a bunch of Jordan's recent large computations of other constants. And as luck would have it, it only blew up on a 40 trillion digit multiply while attempting a Pi record.
Fast forward 2 more days (with further delays caused by the system getting knocked offline by someone physically bumping into it), the program (with a fix implemented) successfully ran through the failure spot. From there, the rest of the computation finished without further incidents and the digits verified correct.
So this bug I found by visual inspection indeed was the bug that was causing all of this. Had this not been the case, things could've gone really bad. 24 hour repro time. $100k+ of hardware being tied up. Half a million lines of code.
Post Mortem:
So what happened? How did I screw this up?
Well... Move fast and break things... and get very unlucky.
Last year, ~70% of the code relevant to Pi was rewritten. And it all happened in a relatively short amount of time (140,000 lines of code over 9 months).
Normally, development is slow and I spend a lot of time validating what I do before moving on. Then there is the implicit (and outsourced) testing of just letting the program soak in the public for a long time while making incremental improvements. Each time someone submits a large run (or a Pi record) it would basically serve as extra validation of what has been done so far.
None of this happened last year as basically the entire program was replaced all at once. And because the task was so big, I started getting impatient to see results. So that meant cutting corners with some of the manual validation that normally happens during incremental development. In other words, the rewrites of v0.8.x lacked both the internal and public testing that y-cruncher normally goes through and became too reliant on internal automated testing.
In the end, quite a few bugs slipped through - 3 of which showed up in this computation:
The N63 multiply bug is the rather unfortunate one as the relevant code was copy/pasted from elsewhere in the program where it had already been proven correct. But in my haste to get things done, I had not realized that the preconditions had changed in a way that invalided both the proof and the code itself.
As luck would have it, the code only breaks on sizes far beyond the scope of the automated tests and far beyond what my own hardware is even capable of. So once the bug was in and I had already moved on to other things, it had no chance of being caught. So someone would've had to find it for me. And that ended up being this 105 trillion digit Pi record attempt.
All 3 bugs were part of the newly rewritten code. The Amdahl's Law/Zen4 Hazard predated the rewrite and is not considered a bug as the code was behaving as originally intended. All of these (including the Amdahl's Law/Zen4 Hazard) have been fixed in the latest release (v0.8.4). So hopefully stability will improve going forward.
You could argue that "move fast and break things" is the right way to approach things that don't involve human safety. And I would agree - except that I don't have the resources to clean up the resulting mess when things break at such scales. It only worked out this time because Jordan, StorageReview, and the relevant hardware sponsors had the patience to let me use their hardware to sort everything out.
y-cruncher has been used to set a number of world record sized computations.
Blue: Current World Record
Green: Former World Record
Red: Unverified computation. Does not qualify as a world record until verified using an alternate formula.
Date Announced | Date Completed: | Source: | Who: | Constant: | Decimal Digits: | Time: | Computer: |
March 14, 2024 | February 27, 2024 | Source | Jordan Ranous Kevin O’Brien Brian Beeler (StorageReview) |
Pi | 105,000,000,000,000 | 2 x AMD Epyc 9754 |
|
February 13, 2024 | February 12, 2024 | Jordan Ranous | Log(2) | 3,000,000,000,000 | 2 x Intel Xeon Platinum 8460H 512 GB |
||
January 17, 2024 | January 10, 2023 | Mamdouh Barakat | Zeta(5) | 250,000,000,000 |
Not Verified |
Intel Xeon Gold 5412U 125 GB |
|
January 17, 2024 | December 12, 2023 | Jordan Ranous | Gamma(1/4) | 1,000,000,000,000 | 2 x Intel Xeon Platinum 8450H 512 GB |
||
December 26, 2023 | December 24, 2023 | Jordan Ranous | e | 35,000,000,000,000 | 2 x Intel Xeon Platinum 8460H |
||
December 26, 2023 | December 25, 2023 | Jordan Ranous | Square Root of 2 | 20,000,000,000,000 | Intel Xeon Platinum 8450H 512 GB Intel Xeon Platinum 8460H 512 GB |
||
December 26, 2023 | December 22, 2023 | Andrew Sun |
Zeta(3) - Apery's Constant | 2,020,569,031,595 | Compute: 5.61 days | Intel Xeon Platinum 8347C 505 GB Intel Xeon Platinum 8347C 507 GB |
|
December 18, 2023 | December 15, 2023 | Jordan Ranous | Gamma(1/3) | 1,000,000,000,000 | 2 x Intel Xeon Platinum 8450H 512 GB |
||
December 18, 2023 | December 11, 2023 | Jordan Ranous | Zeta(5) | 201,000,001,300 | 2 x AMD EPYC 9754 1.5 TB |
||
December 2, 2023 | November 27, 2023 | Jordan Ranous | Golden Ratio | 20,000,000,000,000 | AMD Epyc 9654 - 1.5 TB Intel Xeon Platinum 8450H |
||
September 9, 2023 | September 7, 2023 | Andrew Sun |
Euler-Mascheroni Constant | 1,337,000,000,000 | Intel Xeon Platinum 83470C 400 GB |
||
July 17, 2022 | July 15, 2022 | Seungmin Kim | Lemniscate | 1,200,000,000,100 | 2 x Intel Xeon Gold 6140 |
||
June 8, 2022 | March 21, 2022 | Emma Haruka Iwao | Pi | 100,000,000,000,000 | 128 vCPU Intel Ice Lake (GCP) |
||
March 14, 2022 | March 9, 2022 | Seungmin Kim | Catalan's Constant | 1,200,000,000,100 | Compute: 48.6 days | 2 x Intel Xeon Gold 6140 |
|
August 17, 2021 | August 14, 2021 | Source | UAS Grisons | Pi | 62,831,853,071,796 | Compute: 108 days Verify: 34.4 hours |
AMD Epyc 7542 1 TB 34 + 4 Hard Drives |
September 13, 2020 | September 6, 2020 | Seungmin Kim | Log(10) | 1,200,000,000,100 | 2 x Intel Xeon E5-2699 v3 756 GB 2 x Intel Xeon Gold 5220 754 GB |
||
January 29, 2020 | January 29, 2020 | Blog | Timothy Mullican | Pi | 50,000,000,000,000 | 4 x Intel Xeon E7-4880 v2 315 GB 48 Hard Drives |
|
March 14, 2019 | January 21, 2019 | Blogs |
Emma Haruka Iwao | Pi | 31,415,926,535,897 | Compute: 121 days | 2 x Undisclosed Intel Xeon > 1.40 TB DDR4 > 240 TB SSD |
November 15, 2016 | November 11, 2016 | Blog Sponsor |
Peter Trueb | Pi | 22,459,157,718,361 | Compute: 105 days | 4 x Xeon E7-8890 v3 1.25 TB DDR4 20 x 6 TB 7200 RPM Seagate |
October 8, 2014 | October 7, 2014 | Sandon Van Ness (houkouonchi) |
Pi | 13,300,000,000,000 | 2 x Xeon E5-4650L 192 GB DDR3 @ 1333 MHz 24 x 4 TB + 30 x 3 TB |
||
December 28, 2013 | December 28, 2013 | Source | Shigeru Kondo | Pi | 12,100,000,000,050 | 2 x Xeon E5-2690 128 GB DDR3 @ 1600 MHz 24 x 3 TB |
See the complete list including other notably large computations. If you want to set a record yourself, the rules are in that link.
The main computational features of y-cruncher are:
Latest Releases: (February 22, 2024)
Downloading any of these files constitutes as acceptance of the license agreement.
OS Download Link Size Windows
35.0 MB Linux (Static)
26.7 MB Linux (Dynamic)
19.0 MB
Downloads can also be found on GitHub. Use this if you prefer HTTPS.
The Linux version comes in both statically and dynamically linked versions. The static version should work on most Linux distributions, but lacks TBB and NUMA binding. The dynamic version supports all features, but is less portable due to the DLL dependency hell.
HWBOT submission is back with this release. So I expect the leaderboards to be rewritten soon.
System Requirements:
Windows:
- Windows 7 or later.
- The HWBOT submitter requires the Java 8 Runtime.
Linux:
- 64-bit Linux is required. There is no support for 32-bit.
- The dynamic version has been tested on Ubuntu 22.04.
All Systems:
- An x86 or x64 processor.
Very old systems that don't meet these requirements may be able to run older versions of y-cruncher. Support goes all the way back to even before Windows XP.
Version History:
Other Downloads (for C++ programmers):
Advanced Documentation:
Comparison Chart: (Last updated: July 11, 2023)
Computations of Pi to various sizes. All times in seconds. All computations done entirely in ram.
The timings include the time needed to convert the digits to decimal representation, but not the time needed to write out the digits to disk.
Blue: Benchmarks are up-to-date with the latest version of y-cruncher.
Green: Benchmarks were done with an old version of y-cruncher that is comparable in performance with the current release.
Red: Benchmarks are significantly out-of-date due to being run with an old version of y-cruncher that is no longer comparable with the current release.
Purple: Benchmarks are from unreleased internal builds that are not speed comparable with the current release.
Laptops + Low-Power:
Processor(s): | Core i7 6820HK | Core i7 11800H | Core i7 11800H |
Generation: | Intel Skylake | Intel Tiger Lake | Intel Tiger Lake |
Cores/Threads: | 4/8 | 8/16 | 8/16 |
Processor Speed: | 3.2 GHz (stock) | ~2.5 GHz (45W PL) | ~3.0 GHz (60W PL) |
Memory: | 64 GB @ 2133 MT/s | 64 GB @ 3200 MT/s | 64 GB @ 3200 MT/s |
Version: | v0.8.1 (14-BDW) | v0.8.1 (18-CNL) | v0.8.1 (18-CNL) |
Instruction Set: | x64 AVX2 + ADX | x64 AVX512-VBMI | x64 AVX512-VBMI |
25,000,000 | 1.500 | 0.655 | 0.530 |
50,000,000 | 3.307 | 1.406 | 1.125 |
100,000,000 | 7.238 | 3.005 | 2.447 |
250,000,000 | 20.596 | 8.576 | 6.855 |
500,000,000 | 45.967 | 19.747 | 15.356 |
1,000,000,000 | 102.885 | 42.727 | 34.308 |
2,500,000,000 | 290.824 | 123.523 | 96.918 |
5,000,000,000 | 640.506 | 247.705 | 218.782 |
10,000,000,000 | 1,391.204 | 526.212 | 480.197 |
Credit: |
Processor(s): | Core i3 8121U | Core i7 11800H | ||||
Generation: | Intel Cannon Lake | Intel Tiger Lake | ||||
Cores/Threads: | 2/4 | 8/16 | ||||
Processor Speed: | ~2.5 - 3.2 GHz (stock) | ~2.5 - 2.8 GHz (45W PL) | ||||
Memory: | 8 GB @ 2400 MT/s | 64 GB @ 3200 MT/s | ||||
Version: | v0.8.1 (14-BDW) | v0.8.1 (17-SKX) | v0.8.1 (18-CNL) | v0.8.1 (14-BDW) | v0.8.1 (17-SKX) | v0.8.1 (18-CNL) |
Instruction Set: | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-VBMI | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-VBMI |
25,000,000 | 2.857 | 2.467 | 1.988 | 0.907 | 0.853 | 0.655 |
50,000,000 | 6.446 | 5.501 | 4.392 | 2.075 | 1.862 | 1.406 |
100,000,000 | 14.335 | 12.257 | 9.490 | 4.176 | 3.749 | 3.005 |
250,000,000 | 42.566 | 36.204 | 27.137 | 12.014 | 10.705 | 8.576 |
500,000,000 | 99.040 | 85.443 | 64.359 | 28.805 | 24.123 | 19.747 |
1,000,000,000 | 228.863 | 198.405 | 151.605 | 63.898 | 55.264 | 42.727 |
2,500,000,000 | 187.882 | 148.423 | 123.523 | |||
5,000,000,000 | 375.130 | 327.776 | 247.705 | |||
10,000,000,000 | 794.573 | 709.606 | 526.212 | |||
Credit: |
Mainstream Desktops:
Processor(s): | Ryzen 5 7600 | Core i9 11700K | Ryzen 9 3950X | Ryzen 9 5950X | Intel Core i9 13900KS | Ryzen 9 7950X | |
Generation: | AMD Zen 4 | Intel Rocket Lake | AMD Zen 2 | AMD Zen 3 | Intel Raptor Lake | AMD Zen 4 | |
Cores/Threads: | 6/12 | 8/16 | 16/32 | 16/32 | 24/32 | 16/32 | |
Processor Speed: | stock | stock | stock | 5.7/4.5 GHz | stock | ||
Memory: | 32 GB | 32 GB - 3200 MT/s | 128 GB - 2666 MT/s | 64 GB - 3200 MT/s | 96 GB - 8000 MT/s | 128 GB - 4400 MT/s | 128 GB - 5200 MT/s |
Program Version: | v0.8.1 (22-ZN4) | v0.8.1 (18-CNL) | v0.8.1 (19-ZN2) | v0.8.1 (19-ZN2) | v0.8.1 (14-BDW) | v0.8.1 (22-ZN4) | |
Instruction Set: | x64 AVX512-GFNI | x64 AVX512-VBMI | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX512-GFNI | |
25,000,000 | 0.439 | 0.501 | 0.588 | 0.490 | 0.241 | 0.312 | 0.307 |
50,000,000 | 1.114 | 1.257 | 1.090 | 0.525 | 0.679 | 0.654 | |
100,000,000 | 2.223 | 2.685 | 2.345 | 1.132 | 1.517 | 1.410 | |
250,000,000 | 6.220 | 7.251 | 6.371 | 3.185 | 4.157 | 3.820 | |
500,000,000 | 13.378 | 13.573 | 15.556 | 13.395 | 7.065 | 8.883 | 8.062 |
1,000,000,000 | 29.497 | 30.415 | 33.925 | 29.301 | 15.901 | 18.542 | 17.039 |
2,500,000,000 | 83.421 | 86.119 | 96.695 | 82.204 | 44.888 | 50.743 | 46.467 |
5,000,000,000 | 181.647 | 193.718 | 215.333 | 181.355 | 99.566 | 110.379 | 101.345 |
10,000,000,000 | 473.958 | 399.012 | 241.162 | 220.522 | |||
25,000,000,000 | 1,361.732 | 680.344 | 623.493 | ||||
Credit: | Joel Rufin | Oliver Kruse |
|
Oliver Kruse | 曾 铮 |
Processor(s): | Core i7 920 | FX-8350 | Core i7 4770K | Ryzen 7 1800X | Ryzen 7 3800X |
Generation: | Intel Nehalem | AMD Piledriver | Intel Haswell | AMD Zen 1 | AMD Zen 2 |
Cores/Threads: | 4/8 | 8/8 | 4/8 | 8/16 | 8/16 |
Processor Speed: | 3.5 GHz | stock | 4.0 GHz | stock | stock |
Memory: | 12 GB - 1333 MT/s | 32 GB - 1600 MT/s | 32 GB - 2133 MT/s | 64 GB - 2866 MT/s | 32 GB - 3600 MT/s |
Program Version: | v0.8.1 (08-NHM) | v0.8.1 (11-BD1) | v0.8.1 (13-HSW) | v0.8.1 (17-ZN1) | v0.8.1 (19-ZN2) |
Instruction Set: | x64 SSE4.1 | x64 FMA4 | x64 AVX2 | x64 AVX2 + ADX | x64 AVX2 + ADX |
25,000,000 | 7.032 | 3.677 | 1.546 | 1.150 | 0.654 |
50,000,000 | 17.174 | 7.703 | 3.259 | 2.527 | 1.415 |
100,000,000 | 36.164 | 16.576 | 6.987 | 5.555 | 3.028 |
250,000,000 | 105.789 | 46.597 | 19.588 | 15.760 | 8.404 |
500,000,000 | 236.096 | 103.165 | 43.197 | 34.659 | 18.440 |
1,000,000,000 | 531.676 | 230.780 | 96.845 | 78.690 | 41.097 |
2,500,000,000 | 669.594 | 274.336 | 220.278 | 117.788 | |
5,000,000,000 | 1,460.714 | 606.605 | 493.388 | 266.719 | |
10,000,000,000 | 1,078.187 | ||||
25,000,000,000 | |||||
Credit: | Oliver Kruse |
High-End Desktops:
Processor(s): | Core i7 5960X | Threadripper 1950X | Core i9 7900X | Core i9 7940X | Threadripper 3990X | Xeon W7-2495X | Xeon W9-3475X |
Generation: | Intel Haswell | AMD Zen 1 | Intel Skylake X | Intel Skylake X | AMD Zen 2 | Intel Sapphire Rapids | Intel Sapphire Rapids |
Cores/Threads: | 8/16 | 16/32 | 10/20 | 14/28 | 64/128 | 24/48 | 36/72 |
Processor Speed: | 4.0 GHz | stock | ~3.6 GHz (200W PL) | 3.6 GHz (AVX512) | 2.9 GHz | 4.1-4.9 GHz | 4.2-4.9 GHz |
Memory: | 64 GB - 2400 MT/s | 64 GB - 2800 MT/s | 128 GB - 3000 MT/s | 128 GB - 3466 MT/s | ~141 GB - 2666 MT/s | 64 GB - 6400 MT/s | 128 GB - 6400 MT/s |
Program Version: | v0.8.1 (13-HSW) | v0.8.1 (17-ZN1) | v0.8.1 (17-SKX) | v0.8.1 (17-SKX) | v0.8.1 (19-ZN2) | v0.8.1 (18-CNL) | v0.8.3 (18-CNL) |
Instruction Set: | x64 AVX2 | x64 AVX2 + ADX | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX2 + ADX | x64 AVX512-VBMI | x64 AVX512-VBMI |
25,000,000 | 0.807 | 0.756 | 0.522 | 0.404 | 0.584 | 0.170 | 0.201 |
50,000,000 | 1.743 | 1.579 | 1.028 | 0.721 | 1.181 | 0.340 | 0.321 |
100,000,000 | 3.647 | 3.273 | 2.048 | 1.451 | 2.409 | 0.726 | 0.586 |
250,000,000 | 10.088 | 8.990 | 5.752 | 4.056 | 5.724 | 2.068 | 1.413 |
500,000,000 | 22.075 | 19.604 | 12.830 | 9.017 | 10.881 | 4.588 | 2.627 |
1,000,000,000 | 49.232 | 43.014 | 28.906 | 20.518 | 21.496 | 10.190 | 5.924 |
2,500,000,000 | 139.404 | 121.645 | 82.764 | 60.636 | 58.009 | 28.881 | 16.345 |
5,000,000,000 | 311.388 | 271.983 | 186.233 | 137.906 | 126.513 | 64.158 | 36.139 |
10,000,000,000 | 669.736 | 613.450 | 401.820 | 302.121 | 274.050 | 124.826 | 78.816 |
25,000,000,000 | 1,125.775 | 843.498 | 768.212 | 225.482 | |||
Credit: | Oliver Kruse | Paul Underwood | 曾 铮 |
Multi-Processor Workstation/Servers:
Due to high core count and the effect of NUMA (Non-Uniform Memory Access), performance on multi-processor systems are extremely sensitive to various settings. Therefore, these benchmarks may not be entirely representative of what the hardware is capable of.
Processor(s): | Xeon Platinum 8375C (AWS x2iedn.32xlarge) |
Xeon Platinum 8488C (AWS m7i.48xlarge) |
Epyc 9R14
(AWS m7a.48xlarge) |
Epyc 9R14
(AWS hpc7a.96xlarge) |
Epyc 9754 | |
Generation: | Intel Sapphire Rapids | Intel Sapphire Rapids | AMD Genoa | AMD Bergamo | ||
Cores/Threads: | 64/128 | 96/192 | 192/192 | 128/256 | 128/128 | |
Processor Speed: | 2.9 GHz | 2.4 GHz | 2.6 GHz | 2.25 - 3.1 GHz | ||
Memory: | 4 TB | 744 GB | 740 GB | 768 GB - 4800 MT/s | ||
Program Version: | v0.8.1 (18-CNL) | v0.8.1 (18-CNL) | v0.8.1 (22-ZN4) | v0.8.1 (22-ZN4) | ||
Instruction Set: | x64 AVX512-VBMI | x64 AVX512-VBMI | x64 AVX512-GFNI | x64 AVX512-GFNI | ||
25,000,000 | 0.250 | 0.163 | 0.216 | 0.213 | 0.245 | 0.229 |
50,000,000 | 0.454 | 0.289 | 0.285 | 0.279 | 0.350 | 0.433 |
100,000,000 | 0.844 | 0.531 | 0.642 | 0.635 | 0.853 | 0.876 |
250,000,000 | 1.976 | 1.288 | 1.776 | 1.716 | 2.224 | 2.133 |
500,000,000 | 3.794 | 2.499 | 3.728 | 3.621 | 4.186 | 3.850 |
1,000,000,000 | 7.650 | 5.149 | 6.547 | 6.265 | 7.063 | 6.495 |
2,500,000,000 | 20.425 | 13.633 | 13.554 | 12.500 | 15.338 | 14.477 |
5,000,000,000 | 45.675 | 29.655 | 25.334 | 22.377 | 29.072 | 28.133 |
10,000,000,000 | 101.468 | 64.026 | 51.134 | 44.059 | 58.797 | 59.007 |
25,000,000,000 | 297.622 | 182.920 | 140.286 | 120.282 | 156.797 | 164.281 |
50,000,000,000 | 678.016 | 410.842 | 321.970 | 275.297 | 350.391 | 368.548 |
100,000,000,000 | 1,549.991 | 943.182 | 771.266 | 672.558 | 829.957 | 853.717 |
250,000,000,000 | 4,488.317 | |||||
500,000,000,000 | 9,685.971 | |||||
Credit: | Greg Hogan | Tim Wesley |
Processor(s): | Xeon Platinum 8124M | Xeon Gold 6148 | Xeon Platinum 8175M | Xeon Platinum 8275CL | Epyc 7742 | Epyc 7B12 | Epyc 7742 |
Generation: | Intel Skylake Purley | Intel Skylake Purley | Intel Skylake Purley | Intel Cascade Lake | AMD Rome | AMD Rome | AMD Rome |
Sockets/Cores/Threads: | 2/36/72 | 2/40/40 | 2/48/96 | 2/48/96 | 2/128/256 | 2/112/224 | 2/128/256 |
Processor Speed: | 3.0 GHz | 2.4 GHz | 2.5 GHz | 3.0 GHz | 2.25 GHz | 2.25 GHz | |
Memory: | 137 GB - ?? | 188 GB - ?? | ~756 GB - ?? | 192 GB | ~504 GB | ~882 GB | 2 TB |
Program Version: | v0.7.5 (17-SKX) | v0.7.6 (17-SKX) | v0.7.6 (17-SKX) | v0.7.8 (17-SKX) | v0.7.7 (17-ZN1) | v0.7.8 (19-ZN2) | v0.7.8 (19-ZN2) |
Instruction Set: | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX512-DQ | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX |
25,000,000 | 0.540 | 0.329 | 0.294 | 0.283 | 0.534 | 0.439 | 0.513 |
50,000,000 | 0.981 | 0.683 | 0.617 | 0.544 | 1.027 | 0.838 | 0.920 |
100,000,000 | 1.905 | 1.456 | 1.305 | 1.169 | 2.298 | 1.796 | 1.887 |
250,000,000 | 5.085 | 3.737 | 3.591 | 3.125 | 5.854 | 4.509 | 4.650 |
500,000,000 | 10.372 | 7.750 | 7.293 | 6.309 | 10.502 | 8.196 | 8.066 |
1,000,000,000 | 21.217 | 16.550 | 15.041 | 13.042 | 17.836 | 14.252 | 13.246 |
2,500,000,000 | 55.701 | 45.693 | 39.329 | 34.028 | 35.485 | 30.592 | 27.011 |
5,000,000,000 | 118.151 | 99.078 | 83.601 | 71.777 | 62.432 | 58.405 | 49.940 |
10,000,000,000 | 247.928 | 212.984 | 176.695 | 153.169 | 115.543 | 116.900 | 98.156 |
25,000,000,000 | 599.653 | 491.988 | 425.442 | 307.995 | 314.907 | 258.081 | |
50,000,000,000 | 1,081.181 | 690.662 | 741.633 | 598.716 | |||
100,000,000,000 | 1715.123 | 1,370.714 | |||||
250,000,000,000 | 3,872.397 | ||||||
Credit: | Jacob Coleman | Oliver Kruse | newalex | Xinyu Miao | Carsten Spille | Greg Hogan | Song Pengei |
Processor(s): | Xeon E5-2683 v3 | Xeon E7-8880 v3 | Xeon E5-2687W v4 | Xeon E5-2686 v4 | Xeon E5-2696 v4 | Epyc 7601 | Xeon Gold 6130F |
Generation: | Intel Haswell | Intel Haswell | Intel Broadwell | Intel Broadwell | Intel Broadwell | AMD Naples | Intel Skylake Purley |
Sockets/Cores/Threads: | 2/28/56 | 4/64/128 | 2/24/48 | 2/36/72 | 2/44/88 | 2/64/128 | 2/32/64 |
Processor Speed: | 2.03 GHz | 2.3 GHz | 3.0 GHz | 2.3 GHz | 2.2 GHz | 2.2 GHz | 2.1 GHz |
Memory: | 128 GB - ??? | 2 TB - ??? | 64 GB | 504 GB - ??? | 768 GB - ??? | 256 GB - ?? | 256 GB - ?? |
Program Version: | v0.6.9 (13-HSW) | v0.7.1 (13-HSW) | v0.7.6 (14-BDW) | v0.7.7 (14-BDW) | v0.7.1 (14-BDW) | v0.7.3 (17-ZN1) | v0.7.3 (17-SKX) |
Instruction Set: | x64 AVX2 | x64 AVX2 | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX2 + ADX | x64 AVX512-DQ |
25,000,000 | 0.907 | 1.176 | 0.490 | 0.494 | 0.715 | 2.459 | 1.150 |
50,000,000 | 1.745 | 2.321 | 1.072 | 0.982 | 1.344 | 4.347 | 1.883 |
100,000,000 | 3.317 | 4.217 | 2.303 | 2.193 | 2.673 | 6.996 | 3.341 |
250,000,000 | 8.339 | 8.781 | 6.196 | 6.044 | 6.853 | 14.258 | 7.731 |
500,000,000 | 17.708 | 15.879 | 13.046 | 12.582 | 14.538 | 24.930 | 15.346 |
1,000,000,000 | 37.311 | 32.078 | 27.763 | 26.852 | 31.260 | 47.837 | 31.301 |
2,500,000,000 | 102.131 | 78.251 | 76.202 | 73.596 | 84.271 | 111.139 | 82.871 |
5,000,000,000 | 218.917 | 164.157 | 165.046 | 160.094 | 192.889 | 228.252 | 179.488 |
10,000,000,000 | 471.802 | 346.307 | 356.487 | 346.305 | 417.322 | 482.777 | 387.530 |
25,000,000,000 | 1,511.852 | 957.966 | 1,006.131 | 980.784 | 1,186.881 | 1,184.144 | 1,063.850 |
50,000,000,000 | 2,096.169 | 2,202.558 | 2,156.854 | 2,601.476 | |||
100,000,000,000 | 4,442.742 | 6,037.704 | |||||
250,000,000,000 | 17,428.450 | ||||||
Credit: | Shigeru Kondo | Jacob Coleman | Cameron Giesbrecht | newalex | "yoyo" | Dave Graham |
The full chart of rankings for each size can be found here:
These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@u.northwestern.edu
Note that I usually do not respond to these emails. I simply put them into the charts which I update periodically (typically within 2 weeks).
Decimal Digits of Pi - Times in Seconds Core i9 7940X @ 3.7 GHz AVX512 |
||
Memory Frequency: | 2666 MT/s | 3466 MT/s |
25,000,000 | 0.839 | 0.758 |
50,000,000 | 1.424 | 1.338 |
100,000,000 | 2.701 | 2.425 |
250,000,000 | 6.489 | 5.877 |
500,000,000 | 13.307 | 11.917 |
1,000,000,000 | 27.913 | 24.915 |
2,500,000,000 | 76.837 | 68.322 |
5,000,000,000 | 168.058 | 148.737 |
10,000,000,000 | 365.047 | 322.115 |
25,000,000,000 | 1,037.527 | 916.039 |
High core count Skylake X processors are known to be heavily bottlenecked by memory bandwidth.
Memory Bandwidth:
Because of the memory-intensive nature of computing Pi and other constants, y-cruncher needs a lot of memory bandwidth to perform well. In fact, the program has been noticably memory bound on nearly all high-end desktops since 2012 as well as the majority of multi-socket systems since at least 2006.
Recommendations:
Don't be surprised if y-cruncher exposes instabilities that other applications and stress-tests do not. y-cruncher is unusual in that it simultaneously places a heavy load on both the CPU and the entire memory subsystem.
Parallel Performance:
y-cruncher has a lot of settings for tuning parallel performance. By default, it makes a best effort to analyze the hardware and pick the best settings. But because of the virtually unlimited combinations of processor topologies, it's difficult for y-cruncher to optimally pick the best settings for everything. So sometimes the best performance can only be achieved with manual settings.
*These are advanced settings that cannot be changed if you're using the benchmark option in the console UI. To change them, you will need to either run benchmark mode from the command line or use the custom compute menu.
Load imbalance is a faily common problem in y-cruncher. The usual causes are:
Large Pages:
Large pages used to not matter in the past, but they do now in the post-Spectre/Meltdown world. Mitigations for the Meltdown vulnerability can have a noticeable performance drop for y-cruncher (up to 5% has been observed). It turns out that turning on large pages can mitigate the penalty for this mitigation. (pun intended)
Refer to the memory allocation guide on how to turn on large pages.
Swap Mode:
This is probably one of the most complicated features in y-cruncher.
Everything in this section is in the process of being re-verified and moved to: https://github.com/Mysticial/y-cruncher/issues
Performance Issues:
Pi and other Constants:
Program Usage:
Hardware and Overclocking:
Academia:
Programming:
Other:
Here's some interesting sites dedicated to the computation of Pi and other constants:
Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.
You can also find me on Twitter as @Mysticial.