Executive Summary
This blog presents the results of the study evaluating 8x V100S on DSS8440 for different HPC and deep learning applications including HPL, LAMMPS and MLPerf-v0.6 suite. In summary:
The rest of this blog lays out the details of this testing. Note that in the future, the same applications will be run on DSS8440 with RTX GPUs (in place of the V100S), and other tests, like V100S performance on the AMD platform, will also be run.
Overview of the Testbed
The Dell EMC DSS8440 server is an accelerator-optimized server, specifically designed for high-performance computing and deep learning workloads. The NVIDIA V100S is the latest member in the Tesla Volta series and it is a double-width 32G PCIe based GPU card. This blog will present the results of the study evaluating 8x V100S on DSS8440 for different HPC and deep learning applications including HPL, LAMMPS and MLPerf-v0.6 suite.
The hardware and software details of the DSS 8440 server tested and the comparison of V100S and V100-PCIe are listed in Table 1 and Table 2.
Table 1: The hardware and software details
Table 2: V100S and V100-PCIe difference in specification
HPC Application Performance
Figure 1: V100S and V100-PCIe HPL results on DSS8440
Figure 1 shows the HPL performance numbers. There is not much difference between V100S and V100-PCIe, because HPL is an extreme stress test application. There is little temperature room for the GPU boost feature, therefore the frequency of the GPUs fall back to the base clock rate very quickly. Because V100S and V100-PCIe have almost the same base clock rate, for GPU compute bounded applications like HPL, V100S delivers about the same level performance as V100-PCIe.
Figure 2: V100S and V100-PCIe LAMMPS results on DSS8440
Figure 2 has the timestep/s results of LAMMPS with Lennard Jones dataset. LAMMPS is an example of molecular dynamics code which is known to be a GPU bandwidth bounded application. V100S delivers 27% more performance than V100-PCIe in this testing. The speedup is contributed not only from the 15% higher boost frequency and 26% more bandwidth but also from the newer software version. V100-PCIe numbers were obtained using old KOKKOS package in LAMMPS 8Feb2019 version. However, the newer version 24Jan2020 had added support for using cuFFT on the GPU with KOKKOS. Most details can be found in this LAMMPS 24Jan2020 release note.
Deep Learning Application Performance
Figure 3: V100S and V100-PCIe MLPerf results on DSS8440
MLPerf training closed division 0.6 version has 6 sub-tests covering wide deep learning domains including image classification (ResNet-50), object detection (Mask R-CNN and SSD), Translation (NMT and Transformer) and reinforcement learning (MiniGo). The comparison results of both GPU cards are presented in Figure 3. Around 1-5% performance gains were observed across the MLPerf suite for V100S, which is consistent with the 1-5% higher throughput in the result log files. The real-time output of GPU clock rate was monitored, and it was observed that V100S GPUs were running at 1-5% higher in all those tests, so the performance benefits came from the higher boosted frequency of V100S.
Conclusions and Future works
In this blog, HPC applications performance with HPL, LAMMPS, and deep learning performance with MLPerf were compared with V100S and V100-PCIe GPU cards on the same DSS8440 server. Application limited by GPU bandwidth like LAMMPS can take advantage of the new V100S GPUs and will get boosted performance for both single and multiple GPUs. Deep learning applications tested in MLPerf also get benefits from the higher boosted clock and higher bandwidth of V100S. The GPU compute bounded HPC benchmark HPL gets the same performance as V100-PCIe. In the future, the same applications on DSS8440 will be run with RTX GPUs, and some other tests like V100S performance on the AMD platform will be explored.