Abstract
Recently, Dell EMC PowerEdge C4140 added a new "Configuration M" solution. As this latest option join the C4140 family, this article presents the results of the study evaluating Configuration M performance vs. Configuration K for different HPC applications including HPL, GROMACS and NAMD.
Overview
The PowerEdge C4140 is a 2-socket, 1U rack server. It includes support for the Intel Skylake processors, up to 24 DIMMs slots, and four double width NVIDIA Volta GPU cards. In the C4140 server family, two configurations that support NVLINK are Configuration K and Configuration M. The comparison of both topologies is shown in Figure 1. The two major differences between these two configurations are described below:
p2pBandwidthLatencyTest
Figure 2: Card-to-card latency with P2P disabled n C4140 Configuration K and M
The p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It measures the card-to-card latency and bandwidth with and without GPUDirect™ Peer-to-Peer enabled. The focus in this test is the Latency part since this program doesn’t measure bandwidth concurrently. The discussion on available real-world bandwidth for applications is in the HPL session below. The numbers listed in Figure 2 are the average of 100 times of unidirectional card-to-card latency in microseconds. Each time the code sends one byte from one card to another, the P2P disabled number is picked in this chart, and because if P2P enabled, the data is transferred through NVLINK instead. The PCIe latency of Configuration M is 1.368 us less than Configuration K due to the different PCIe topologies.
High Performance Linpack (HPL)
Figure 3 (a) shows HPL performance on the C4140 platform with 1, 2, 4 and 8 V100-SXM2 GPUs. 1-4 GPUs results are from a single C4140, the 8 GPUs performance result is across two servers. In this test, the HPL version used is provided by NVIDIA, and is compiled with recent released CUDA 10 and OpenMPI. The following aspects can be observed from the HPL results:
1) Single node. With all 4 GPUs in test, Configuration M is ~16% faster than Configuration K. Before the HPL application starts computing, it measures the available device-to-host (D2H) and host-to-device (H2D) PCIe bandwidth for each GPU card, when all cards transfer data concurrently. This information provides useful insights on true PCIe bandwidth for each card when HPL copies the N*N Matrix to all GPU memories at the same time. As shown in Figure 3 (b), both D2H and H2D numbers of Configuration M are much higher and are reaching the theoretically throughput of PCIe x16. This matches up with its hardware topology as each GPU in Configuration M has a dedicated PCIe x16 Links to CPU. In Configuration K, all four V100s have to share a single PCIe x16 link via the PLX PCIe Switch so there’s only 2.5GB/s available to each of them. Because of the bandwidth difference, Configuration M took 1.33 seconds to copy the 4 pieces 16GB N*N Matrix to each GPUs’ global memory, and Configuration K took 5.33 seconds. The entire HPL application runs around 23 to 25 seconds. Since all V100-SXM2 are the same, compute time is the same, so this 4 seconds savings from data copying makes Configuration M 16% faster.
2) Multiple nodes. The results of 2 C4140 nodes with 8 GPUs show 15%+ HPL improvement in two nodes. This means Configuration M has better scalability across nodes than Configuration K for the same reason as the single nodes 4 cards in the case above.
3) Efficiency. Power consumption was measured with iDrac, Figure 3 (c) shows the wattage in time series. Both system reaches around 1850 W at peak, due to higher GFLOPS number, Configuration M provides higher performance per watt number as well as HPL efficiency.
HPL is a system level benchmark and its outcomes are determined by components like CPU, GPU, memory and PCIe bandwidth. Configuration M has a balanced design across the two CPUs; therefore, it outperforms Configuration K in this HPL benchmark.
GROMACS
GROMACS is an open source molecular dynamics application designed to simulate biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions. Version 2018.3 is tested on water 3072 dataset which has 3 million atoms.
Figure 4: GROMACS Performance results with multiple V100 on C4140 Configuration K and M
Figure 4 shows the performance improvement of Configuration M over K. Single card performance is the same across the two configurations since there is no difference on the data path. With 2 and 4 GPUs, Configuration M is ~5% faster than K. When tested across 2 nodes, Configuration M has up to 10% better performance; the main reason being the increased number of PCIe connections which provide more bandwidth and allow more data to quickly feed the GPUs. GROMACS is greatly accelerated with GPUs but this application uses both CPUs and GPUs for calculation in parallel; therefore, if GROMACS is the top application in a cluster, a powerful CPU is recommended. This graph also shows GROMACS performance scaling with more servers and more GPUs. While the application’s performance does increase with more GPUs and more servers, the performance increase with additional GPUs is less than linear.
NAnoscale Molecular Dynamics (NAMD)
NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuild binary wasn’t used. Instead, NAMD was built with the latest source code (NAMD_Git-2018-10-31_Source) on CUDA 10. Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). Tests on smaller datasets like f1atpase (327,506 atoms, periodic, PME) and apoa1 (92,224 atoms, periodic, PME) resulted in similar comparisons between Configuration M and Configuration K but are not presented here for brevity.
Figure 5: NAMD Performance results with multiple V100s on C4140 Configuration K and M
Like GROMACS, 4 times more PCIe bandwidth helps the performance on NAMD. Figure 5 shows that the performance of Configuration M with 2 and 4 cards is 16% and 30% more than Configuration K, respectively, on STMV dataset. Single card performance is expected to be the same since, with only one GPU in test, PCIe bandwidth is identical.
Conclusions and Future Work
In this blog, HPC applications performance with HPL, GROMACS and NAMD was compared across two different NVLINK configurations of the PowerEdge C4140. HPL, GROMACS and NAMD perform ~10% better on Configuration M than on Configuration K. In all of the tests, at a minimum, Configuration M delivers the same performance of Configuration K, since it has all the good features of Configuration K plus more PCIe links and no PCIe switches. In the future, additional tests are planned with more applications like RELION, HOOMD and AMBER, as well as tests using the V100 32G GPU.