Přeskočit na hlavní obsah

HPC synthetic benchmark performance using 2nd Generation Intel® Xeon® Scalable Processors – STREAM, HPL and HPCG

Shrnutí: Refer to the information about the HPC synthetic benchmark performance using 2nd Generation Intel® Xeon® Scalable Processors – STREAM, HPL and HPCG

Tento článek se vztahuje na Tento článek se nevztahuje na Tento článek není vázán na žádný konkrétní produkt. V tomto článku nejsou uvedeny všechny verze produktu.

Příznaky

Article written by Savitha Pareek, Varun Bawa, & Ashish K Singh of HPC and AI Innovation Lab in June 2019

2nd Generation Intel® Xeon® Scalable Family Processors (architecture codenamed – Cascade Lake) is Intel’s Successor to Skylake and is ready for its prime time. The HPC engineering team at Dell EMC had access to a few engineering test units and this blog presents the results of our initial benchmarking study.

The intent of this blog is to illustrate and analyze the performance obtained on the latest Intel® Xeon® Scalable family processors and compare the performance to its predecessor. We have chosen STREAM, HPL and HPCG benchmarks for our analysis. The study highlights the performance impact for single as well as multiple nodes. These tests have been performed on Dell EMC PowerEdge C6420 (single node study) and PowerEdge R740 (multi-node study) with recommended BIOS settings for HPC workloads. Cascade Lake processor comes with many enhancements such as Intel® Deep Learning Boost (Intel DL Boost) with VNNI, higher memory bandwidth, and increased vector floating point performance and efficiency.

Příčina

 

Řešení

Table 1: Testbed information

Server

 PowerEdge C6420 & PowerEdge R740

Processors

Single Node Configuration

Multi Node Configuration

Server - PowerEdge C6420 & PowerEdge R740

Server- PowerEdge R740

Skylake –

Intel Xeon® 6142[16C@2.6GHz]

Intel Xeon® 6130 [16C@2.1GHz]

Intel Xeon® 8180[28C @2.5GHz]

Cascade Lake –

Intel Xeon® 8268[24C@2.90GHz]

Cascade Lake –

Intel Xeon® 6242[16C @2.8GHz]

Intel Xeon® 6230 [20C@2.1GHz]

Intel Xeon® 8280 [28C @2.7GHz]

Memory

Cascade Lake test -192GB -12 x 16GB 2933 MT/s DDR4

Skylake test – 192GB-12 x 16GB 2933 MT/s DDR4 (Active 2666 MT/s)

Operating System

Red Hat Enterprise Linux 7.6

Kernel Version

3.10.0-957.el7.x86_64

BIOS Options

Turbo=Enabled, Logical Processor=Disabled, SubNumaCluster=Enabled, Virtualization Technology=Disabled.

InfiniBand

Intel Omni path with IFS 10.9.2

Compiler

Intel Parallel Studio XE 2018 update 4

Applications

Benchmark

Domain

Version

Test configuration

HPL

High Performance LINPACK- Computational

Intel MKL – 2018 U4

Problem Size – 90% of Total Memory

HPCG

High Performance Conjugate Gradient – Computational 

Intel MKL – 2018 U4

Problem Size – 336 x 336 x 336

STREAM

Memory Bandwidth

5.4

Triad

         

Tests were conducted to quantify the following two cases:

  • Performance improvement on a single node from Skylake to Cascade Lake
  • Performance improvement with single node vs. multi-node

STREAM -

To obtain the peak memory bandwidth performance on Intel Cascade Lake and Skylake, we have chosen STREAM benchmark which is the de facto industry standard benchmark in HPC domain for the measurement of sustainable memory bandwidth (in GB/s). TRIAD value has been used to compare memory bandwidth.

SLN317735_en_US__1image(10401)

Figure1: STREAM – Skylake vs Cascade Lake

The supported maximum memory frequency for Skylake is 2666MT/s while Cascade Lake supports 2933MT/s, meaning 10% higher memory frequency with Cascade Lake. As per Figure 1, Cascade Lake processors show 7 – 12% more memory bandwidth relative to Skylake. Memory bandwidth per core is dependent on the specific processor SKU. Since some Cascade Lake SKUs have additional cores relative to Skylake, the per core memory bandwidth comparisons are different than the total memory bandwidth comparison. As per Figure 1, both 8280 and 6242 have higher memory bandwidth per core up to 7% than their respective predecessors. However, 6230 shows 11% less memory bandwidth per core relative to 6130 due to the 25% increase in cores for 6230. Memory bandwidth per core can be an important factor for applications which are memory bandwidth sensitive.

LINPACK -

We measured the computational capability of processors using Intel LINPACK. The problem size (N) is 90% of system memory while the block size (NB) is 384. Here we are covering both performance and scaling with Cascade Lake processors.

Skylake vs Cascade Lake 

SLN317735_en_US__2image(13765)

Figure 2: LINPACK performance (Skylake vs Cascade Lake)

As per Figure 2, LINPACK shows performance improvement up to 15% with Cascade Lake processors. This comparison is based on the CPU model number, comparing Skylake and their successors of Intel Xeon® Scalable family. Intel Xeon® 6230 with 4 more cores per socket gets a 15% boost in performance over 6130, while both 8280 and 6242 with similar core count as their predecessors adds in performance improvement due to increase in its CPU base frequency and higher memory bandwidth.

Multi-Node Performance - For the multi-node study, we have used an 8-node cluster of PowerEdge R740 servers with Intel Xeon® 8268 and captured results for 1, 2, 4 and 8 nodes. The rest of the system configuration is aforementioned in Table 2.

SLN317735_en_US__3image(10402)

                                                Figure 3: Multi-node LINPACK performance with 8268 @2.90GHz

As figure 3 shows, LINPACK performance for a single 8268 node is 3059 GFLOPS and 23946 GFLOPS for 8 nodes which means 7.83X scaling from 1 node to 8 nodes. Efficiency for a single node is ~69%, while ~67% for 2, 4 and 8 nodes. Efficiency drops from 1 node to 2 nodes; however, the scalability is mostly linear afterwards.               

HPCG Benchmark

The HPCG benchmark is based on conjugate gradient solver, where the pre-conditioner is a three-level hierarchical multi-grid (MG) method with Gauss-Seidel.

The HPCG benchmark constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in a 3D domain such that the equation at the point (i, j, k) depends on its values and 26 surrounding neighbours. The global domain computed by benchmark is (NRx * Nx) X (NRy*Ny) X (NRz*Nz), where Nx, Ny and Nz are dimensions of local sub-grids, assigned to each MPI process and number of MPI ranks are NR = (NRx X NRy X NRz).

For our analysis, we have divided tests into 2 categories-

Skylake vs Cascade LakeIn this section, we compare Skylake with Cascade Lake by using HPCG performance. We have utilized the grid size of 336^3 which occupies more than 1/4th of total system memory. The number of MPI processes per node and the number of threads was based on the best results and utilization of memory.

SLN317735_en_US__4image(10403)  

Figure 4: HPCG performance (Skylake vs Cascade Lake)

As per Figure 4, we observe significant HPCG performance improvement with Cascade Lake processors over their predecessors. As HPCG is more memory bound application, performance improvement with Cascade Lake processors is in line with the result of STREAM benchmark where 6230 performs 10% better than 6130, 6242 performs 12% better than 6142 and 8280 performs 7% better than 8180.   

HPCG with Multi-Node – For multi-node benchmarking, we have chosen the local dimension grid size of 336^3 and best MPI process and OpenMP Thread combination.

SLN317735_en_US__5image(10404)

Figure 5: Multi-node HPCG performance with Cascade Lake

Figure 5 shows the performance of HPCG with Cascade Lake 8268 @2.9GHz and scaling up to 8 nodes. HPCG performance is 43GFLOPS for single node and 84GFLOPS for two nodes, meaning 1.96X performance improvement with two nodes. As we move forward with 4 and 8 nodes, performance improves up to 7.7X.          

Conclusion

With the availability of Cascade Lake processors, PowerEdge systems can now support memory speeds of up to 2933 MT/s with this newer generation processor. Our tests with Cascade Lake processors show a 7-12% performance improvement in memory bandwidth, 4-15% improvement in HPL and 7-12% improvement in HPCG on the CPU models we compared. Cascade Lake tests from 1 to 8 nodes show good scalability, as we have seen with Skylake in the past.

Additionally, Cascade Lake introduces VNNI instructions that can speed up deep learning inference workloads by 2x-3x, further discussed in this blog.

For our future work, we plan to evaluate the performance advantage of Cascade Lake on different HPC applications such as WRF, NAMD, GROMACS, CP2K, and LAMMPS

Dotčené produkty

High Performance Computing Solution Resources
Vlastnosti článku
Číslo článku: 000133009
Typ článku: Solution
Poslední úprava: 18 kvě 2021
Verze:  4
Najděte odpovědi na své otázky od ostatních uživatelů společnosti Dell
Služby podpory
Zkontrolujte, zda se na vaše zařízení vztahují služby podpory.