Savitha Pareek, HPC, and AI Innovation Lab, May 2020
The HPC and AI Innovation Lab at Dell are diving deeper into the AMD based systems with a growing evaluation program for the latest EPYC (Rome) microprocessors from AMD. In our previous blog (Molecular Dynamic Simulation with Gromacs on AMD EPYC Rome) we posted early benchmark data for the GROMACS application study on a single node and introduced Minerva: a 64-node server, Rome-based, PowerEdge C6525 cluster for our multi-node study.
Initial performance blog on Rome-based server for molecular dynamic simulation of GROMACS on a single node depicted single node performance numbers. Turning the spotlight on various attributes such as 1st generation Naples to 2nd generation Rome, enabling and disabling Logical Processors, analysis on different AMD EPYC SKUs, and tuning in BIOS options, we gained a baseline for our multi-node study to perform on our new "Minerva Cluster". This blog intends to walk you through the multi-node scaling of AMD EPYC Rome on "GROMACS".
The scaling of GROMACS on multiple nodes was evaluated using the two-socket DellEMC PowerEdge servers. For this study, we carried out all benchmarks on a 58-node cluster. The Cluster configuration is included in Table 1(a), with the list of the benchmark data sets given in Table 1(b).
Table 1(a)-Multi node Cluster Configuration
Component |
Description |
||||
---|---|---|---|---|---|
Processor |
CPU |
Cores |
Config |
Base frequency |
TDP |
7452 |
32C |
4C per CCX |
2.35GHz |
155W |
|
Compute Nodes |
58 Nodes |
||||
Memory |
256 GB, 16x16GB 3200 MT/s DDR4 per node |
||||
Operating System |
Red Hat Enterprise Linux 7.6 |
||||
Kernel |
3.10.0.957.27.2.e17.x86_64 |
||||
Application |
GROMACS 2019.3 |
||||
BIOS Version |
1.0.0 |
||||
Compiler |
AOCC 2.0.0 |
||||
FFTW |
3.3.8 |
Table 1(b)- Benchmark datasets used for GROMACS performance evaluation on ROME
Dataset |
Details |
---|---|
1536K and 3072K |
|
1400K and 3000K |
|
Prace – Lignocellulose |
3M |
Figures 1 through 5 below are graphical excerpts from our multi-node analysis.
Figure 1. Multi-node performance evaluation with Water 1536 dataset mapping the Logical Processors disabled vs Logical Processors Enabled data
Figure 2. Multi-node performance evaluation with Water 3072 dataset mapping the Logical Processors disabled vs Logical Processors Enabled data
Figure 3. Multi-node performance evaluation with HecBioSim 1.4M dataset mapping the Logical Processors disabled vs Logical Processors Enabled data
Figure 4. Multi-node performance evaluation with HecBioSim 3M dataset mapping the Logical Processors disabled vs Logical Processors Enabled data
Figure 5. Multi-node performance evaluation with Lignocellulose 3M dataset mapping the Logical Processors disabled vs Logical Processors Enabled data
For the multi-node study, we compiled GROMACS version 2019.3 with the latest OPENMPI-4.0.0). We tested different compilers on Rome platform, added associated high-level compiler flags, electrostatic field load balancing (i.e. PME, etc), tested with multiple ranks, separate PME ranks, varying different nstlist values and created a paradigm for our application (GROMACS) to test on a handful of nodes.
To understand the performance gain of "simultaneous multithreading" (i.e., Logical Processors named as per Dell BIOS option on Rome based systems), with GROMACS we executed several benchmarks with "Logical Processors Enabled" vs "Logical Processors Disabled. As an example, for the 32-core based 7452 benchmarks, the disabled logical processor with a single node used 64 threads (dual-socket server), and the enabled logical processor results used 128 threads (using all logical cores in a system).
All the figures as seen above represent the parallel scalability when running Gromacs with up to 58 nodes configured with AMD EPYC 7452 processor. All processor cores in each server were used when running these benchmarks. The performance at each node count is presented relative to the performance of a single node.
The performance is boosted across all datasets when Logical Processors is enabled with an increasing number of node counts, this is because some of the internal components of the core (called execution units) are frequently idle during each clock cycle. By enabling a logical processor, the execution units can process instructions from two threads simultaneously, which means fewer execution units will be idle during each clock cycle. As a result, enabling logical processors may significantly boost system performance.
To test this on the targeted datasets as mentioned in table 1(c), we ran few test cases with optimized compiler flags and found that the application GROMACS out-Herod up to 32 nodes with Logical Processors enabled after which a 9-9.5% drop is seen in smaller dataset sets such as water 1536K and HecSimBio 1400k. In contrast, larger datasets like water 3M HecSimBio 3M had significant gain up to 58 nodes. The scalability for these benchmarks is as expected, with the largest dataset demonstrating strong scaling.
One of the largest challenges with understanding the performance improvements from a logical processor is how processor performance is reported by performance tools. "% Processor Time" by calculating the percentage of time that logical processors executed idle threads (during the reporting interval) and subtracting that amount from 100%. For applications like Gromacs with a high rate of memory I/O and multiple threads with larger physical cores, the system with logical processor enabled performed better with the larger dataset as it had many active threads and high memory I/O. For Molecular Dynamic applications that allow an MPI parallelism the ratio between both parallel layers is also a performance-critical point and relies a lot on the hardware environment such as network interconnect.
Conclusion
The Minerva Cluster at the HPC and AI Innovation Lab equipped with the latest AMD ROME processors offer significant multi-node performance gains for applications such as GROMACS. We found a strong positive correlation with overall system performance with Logical Processors enabled on a larger dataset and a weak correlation with Logical Processors enabled on smaller datasets. Watch this blog site for updates.