Article written by Savitha Pareek, Varun Bawa, & Ashish K Singh of HPC and AI Innovation Lab in June 2019
2nd Generation Intel® Xeon® Scalable Family Processors (architecture codenamed – Cascade Lake) is Intel’s Successor to Skylake and is ready for its prime time. The HPC engineering team at Dell EMC had access to a few engineering test units and this blog presents the results of our initial benchmarking study.
The intent of this blog is to illustrate and analyze the performance obtained on the latest Intel® Xeon® Scalable family processors and compare the performance to its predecessor. We have chosen STREAM, HPL and HPCG benchmarks for our analysis. The study highlights the performance impact for single as well as multiple nodes. These tests have been performed on Dell EMC PowerEdge C6420 (single node study) and PowerEdge R740 (multi-node study) with recommended BIOS settings for HPC workloads. Cascade Lake processor comes with many enhancements such as Intel® Deep Learning Boost (Intel DL Boost) with VNNI, higher memory bandwidth, and increased vector floating point performance and efficiency.
Table 1: Testbed information
Server |
PowerEdge C6420 & PowerEdge R740 |
|||
Processors |
Single Node Configuration |
Multi Node Configuration |
||
Server - PowerEdge C6420 & PowerEdge R740 |
Server- PowerEdge R740 |
|||
Skylake – Intel Xeon® 6142[16C@2.6GHz] Intel Xeon® 6130 [16C@2.1GHz] Intel Xeon® 8180[28C @2.5GHz] |
Cascade Lake – Intel Xeon® 8268[24C@2.90GHz] |
|||
Cascade Lake – Intel Xeon® 6242[16C @2.8GHz] Intel Xeon® 6230 [20C@2.1GHz] Intel Xeon® 8280 [28C @2.7GHz] |
||||
Memory |
Cascade Lake test -192GB -12 x 16GB 2933 MT/s DDR4 Skylake test – 192GB-12 x 16GB 2933 MT/s DDR4 (Active 2666 MT/s) |
|||
Operating System |
Red Hat Enterprise Linux 7.6 |
|||
Kernel Version |
3.10.0-957.el7.x86_64 |
|||
BIOS Options |
Turbo=Enabled, Logical Processor=Disabled, SubNumaCluster=Enabled, Virtualization Technology=Disabled. |
|||
InfiniBand |
Intel Omni path with IFS 10.9.2 |
|||
Compiler |
Intel Parallel Studio XE 2018 update 4 |
|||
Applications |
||||
Benchmark |
Domain |
Version |
Test configuration |
|
HPL |
High Performance LINPACK- Computational |
Intel MKL – 2018 U4 |
Problem Size – 90% of Total Memory |
|
HPCG |
High Performance Conjugate Gradient – Computational |
Intel MKL – 2018 U4 |
Problem Size – 336 x 336 x 336 |
|
STREAM |
Memory Bandwidth |
5.4 |
Triad |
|
Tests were conducted to quantify the following two cases:
STREAM -
To obtain the peak memory bandwidth performance on Intel Cascade Lake and Skylake, we have chosen STREAM benchmark which is the de facto industry standard benchmark in HPC domain for the measurement of sustainable memory bandwidth (in GB/s). TRIAD value has been used to compare memory bandwidth.
Figure1: STREAM – Skylake vs Cascade Lake
The supported maximum memory frequency for Skylake is 2666MT/s while Cascade Lake supports 2933MT/s, meaning 10% higher memory frequency with Cascade Lake. As per Figure 1, Cascade Lake processors show 7 – 12% more memory bandwidth relative to Skylake. Memory bandwidth per core is dependent on the specific processor SKU. Since some Cascade Lake SKUs have additional cores relative to Skylake, the per core memory bandwidth comparisons are different than the total memory bandwidth comparison. As per Figure 1, both 8280 and 6242 have higher memory bandwidth per core up to 7% than their respective predecessors. However, 6230 shows 11% less memory bandwidth per core relative to 6130 due to the 25% increase in cores for 6230. Memory bandwidth per core can be an important factor for applications which are memory bandwidth sensitive.
LINPACK -
We measured the computational capability of processors using Intel LINPACK. The problem size (N) is 90% of system memory while the block size (NB) is 384. Here we are covering both performance and scaling with Cascade Lake processors.
Skylake vs Cascade Lake –
Figure 2: LINPACK performance (Skylake vs Cascade Lake)
As per Figure 2, LINPACK shows performance improvement up to 15% with Cascade Lake processors. This comparison is based on the CPU model number, comparing Skylake and their successors of Intel Xeon® Scalable family. Intel Xeon® 6230 with 4 more cores per socket gets a 15% boost in performance over 6130, while both 8280 and 6242 with similar core count as their predecessors adds in performance improvement due to increase in its CPU base frequency and higher memory bandwidth.
Multi-Node Performance - For the multi-node study, we have used an 8-node cluster of PowerEdge R740 servers with Intel Xeon® 8268 and captured results for 1, 2, 4 and 8 nodes. The rest of the system configuration is aforementioned in Table 2.
Figure 3: Multi-node LINPACK performance with 8268 @2.90GHz
As figure 3 shows, LINPACK performance for a single 8268 node is 3059 GFLOPS and 23946 GFLOPS for 8 nodes which means 7.83X scaling from 1 node to 8 nodes. Efficiency for a single node is ~69%, while ~67% for 2, 4 and 8 nodes. Efficiency drops from 1 node to 2 nodes; however, the scalability is mostly linear afterwards.
HPCG Benchmark
The HPCG benchmark is based on conjugate gradient solver, where the pre-conditioner is a three-level hierarchical multi-grid (MG) method with Gauss-Seidel.
The HPCG benchmark constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in a 3D domain such that the equation at the point (i, j, k) depends on its values and 26 surrounding neighbours. The global domain computed by benchmark is (NRx * Nx) X (NRy*Ny) X (NRz*Nz), where Nx, Ny and Nz are dimensions of local sub-grids, assigned to each MPI process and number of MPI ranks are NR = (NRx X NRy X NRz).
For our analysis, we have divided tests into 2 categories-
Skylake vs Cascade Lake – In this section, we compare Skylake with Cascade Lake by using HPCG performance. We have utilized the grid size of 336^3 which occupies more than 1/4th of total system memory. The number of MPI processes per node and the number of threads was based on the best results and utilization of memory.
Figure 4: HPCG performance (Skylake vs Cascade Lake)
As per Figure 4, we observe significant HPCG performance improvement with Cascade Lake processors over their predecessors. As HPCG is more memory bound application, performance improvement with Cascade Lake processors is in line with the result of STREAM benchmark where 6230 performs 10% better than 6130, 6242 performs 12% better than 6142 and 8280 performs 7% better than 8180.
HPCG with Multi-Node – For multi-node benchmarking, we have chosen the local dimension grid size of 336^3 and best MPI process and OpenMP Thread combination.
Figure 5: Multi-node HPCG performance with Cascade Lake
Figure 5 shows the performance of HPCG with Cascade Lake 8268 @2.9GHz and scaling up to 8 nodes. HPCG performance is 43GFLOPS for single node and 84GFLOPS for two nodes, meaning 1.96X performance improvement with two nodes. As we move forward with 4 and 8 nodes, performance improves up to 7.7X.
Conclusion
With the availability of Cascade Lake processors, PowerEdge systems can now support memory speeds of up to 2933 MT/s with this newer generation processor. Our tests with Cascade Lake processors show a 7-12% performance improvement in memory bandwidth, 4-15% improvement in HPL and 7-12% improvement in HPCG on the CPU models we compared. Cascade Lake tests from 1 to 8 nodes show good scalability, as we have seen with Skylake in the past.
Additionally, Cascade Lake introduces VNNI instructions that can speed up deep learning inference workloads by 2x-3x, further discussed in this blog.
For our future work, we plan to evaluate the performance advantage of Cascade Lake on different HPC applications such as WRF, NAMD, GROMACS, CP2K, and LAMMPS