With the release of the 2nd generation Intel Xeon® Processor Scalable Family processors (architecture codenamed "Cascade Lake"), Dell EMC has updated the PowerEdge 14th generation servers to benefit from the increased number of cores and higher memory speeds thus benefiting HPC applications.
This blog presents the first set of results and discusses the impact of the different BIOS tuning options available on Dell EMC PowerEdge C6420 with the latest Intel Xeon® Cascade Lake processors for some HPC benchmarks and applications. A brief description of the Cascade Lake processor, BIOS options and HPC applications used in this study is provided below.
Cascade Lake is Intel’s successor for Skylake. The Cascade Lake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2933 MT/s. Similar to Skylake, Cascade Lake supports additional vectorization power with the AVX512 instruction set allowing 32 DP FLOP/cycle. Cascade Lake introduces the Vector Neural Network Instructions (VNNI), which accelerates performance of AI and DL workloads like Image classification, speech recognition, language translation, object detection and more. VNNI also supports 8-bit instruction to accelerate inference performance.
Cascade Lake includes hardware mitigations for some side-channel vulnerabilities. It is expected that this could improve performance on storage workloads, look for future studies from the Innovation Lab.
Since Skylake and Cascade Lake are socket compatible, the processor tuning knobs exposed in the system BIOS are similar across these processor generations. The following BIOS tuning options were explored in this study, similar to work published in the past on Skylake.
The system profiles are a meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstate, C1E, Pstate management, Uncore frequency, etc. The different system profiles compared in this study include:
We used two HPC benchmarks and two HPC applications to understand the impact of these BIOS options on Cascade Lake performance. The configurations of server and HPC applications used for this study is described in Table 1 and Table 2.
Applications | Domain | Version | Benchmark |
---|---|---|---|
High Performance Linpack(HPL) | Computation-Solve a dense system of linear equations | From Intel MKL - 2019 Update 1 | Problem size 90%, 92% and 94% of total memory |
Stream | Memory Bandwidth | 5.4 | Triad |
WRF | Weather Research and Forecasting | 3.9.1 | Conus 2.5km |
ANSYS® Fluent® | Fluid Dynamics | 19.2 | Ice_2m, Combustor_12m, Aircraft_wing_14m, Exhaust_System_33m |
Table 1: Applications and Benchmarks
Components | Details |
---|---|
Server | PowerEdge server C6420 |
Processor | Intel® Xeon® Gold 6230 CPU @ 2.1GHz, 20 cores |
Memory | 192GB – 12 x 16GB 2933 MT/s DDR4 |
Operating System | Red Hat Enterprise Linux 7.6 |
Kernel | 3.10.0-957.el7.x86_64 |
Compiler | Intel Parallel Studio Cluster Edition_2019_Update_1 |
Table 2 Server Configuration
All the results shown here are based on single-server tests; cluster level performance will be bound by the single server performance. The following metrics were used to compare performance:
Perf – Performance OS – PerformancePerWattOS
DAPC – PerformancePerWattDAPC
Sub-NUMA Clustering: SNC = 0(SNC = Disabled): SNC = 1(SNC = Enabled: Formatted as Striped in graphs)
SW – Software Prefetcher: SW = 0 (SW = Disabled): SW = 1 (SW = Enabled)
Figure 2 compares the result of HPL with Problem size = 90% i.e. N=144476 across different BIOS options. The graph plots absolute Gigaflops obtained while running HPL across different BIOS configurations. These Gigaflops obtained are plotted on y-axis, higher is better.
Below are the observations from the graph:
Figure 3 compares the result of STREAM across the different BIOS configurations.
The graph plots the Memory bandwidth in Gigabytes per second obtained while running STREAM Triad. The Memory bandwidth (GB/Sec) obtained is plotted on y-axis, higher is better. The BIOS configuration associated to specific values of Gigabytes per second are plotted on the x-axis.
Below are the observations from the graph:
Figure 4 plots the Stream Triad memory bandwidth score in such a configuration. The full system memory bandwidth is ~220 GB/s. When 20 cores on a local socket access local memory, the memory bandwidth is ~ 109GB/s - half of the full system bandwidth. Half of this, ~56 GB/s, is the memory bandwidth of 10 threads on the same NUMA node accessing their local memory and on one NUMA node access memory belonging to the other NUMA node on the same socket. There is a 42% drop in memory bandwidth to ~33GB/s when the threads access remote memory across the QPI link on the remote socket. This tells us there is significant bandwidth penalty in SNC mode when data is not local.
Figure 5 compares the result of WRF across different BIOS options, the dataset used is conus2.5km with default "namelist.input" file.
The graph plots absolute average timestep in seconds obtained while running WRF-conus2.5km dataset on different BIOS configurations. The average timestep obtained is plotted on y-axis, lower is better. The relative profiles associated to specific values of average timestep are plotted on the x-axis.
Below are the observations from the graph:
Figure 6 through Figure 9 plots Solver Rating obtained while running Fluent- with Ice_2m, Combustor_12m, Aircraft_Wing_14m, and Exhaust_System_33m dataset respectively. The Solver Rating obtained is plotted on y-axis, Higher is better. The relative profiles associated to specific values of Average time are plotted on the x-axis.
Below are the overall observations from the above graphs:
In this study, we evaluated the impact of different BIOS tuning options on performance when using the Intel Xeon Gold 6230 processor. Observing the performance of different BIOS options across different benchmarks and applications, the following is concluded:
It is recommended that Hyper-Threading be turned off for general-purpose HPC clusters. Depending on the applications used, the benefit of this feature should be tested and enabled as appropriate.
Not discussed in this study is a memory RAS featured called Adaptive Double DRAM Device Correction (ADDDC) that is available when a system is configured with memory that has x4 DRAM organization (32GB, 64GB DIMMs). ADDDC is not available when a system has x8 based DIMMs (8GB, 16GB) and is immaterial in those configurations. For HPC workloads, it is recommended that ADDDC be set to disabled when available as a tunable option.
N/A
N/A