Article Number: 000142617
Abstract
In the Dell EMC Ready Solutions for AI – Deep Learning with NVIDIA architecture guide, the ready solution that includes carefully selected technologies was described in detail including the details of design choice of each component. The architecture guide introduced two types of Dell EMC PowerEdge C4140 GPU servers: PCIe and SXM2. In this blog, another configuration of SXM2 is introduced and we will quantify the performance comparison for both SXM2 configurations.
Overview
The Dell EMC PowerEdge C4140 is a 2-socket, 1U rack server. The system features the Intel Skylake processors, up to 24 DIMMs, and 4 double width GPUs. There are two types of SXM2 configurations for C4140 server: configuration K and configuration M. The comparison of both configurations is shown in Figure 1. In configuration K, the two CPUs are connected to four GPUs by only one PCIe link. In configuration M, however, each CPU is connected to one GPU by PCIe link. Therefore, there are four PCIe links connecting the two CPUs with four GPUs. This blog will present the performance difference of deep learning training between these two configurations. The deep learning frameworks we benchmarked include TensorFlow and Horovod. Horovod is a distributed framework for TensorFlow. We used Horovod because it has better scalability implementation (using MPI model) than the distributed implementation in TensorFlow itself, which has been explained in the article "Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow".
Table 1 shows the hardware configuration and software details we tested. The evaluation was done on up to 8 GPUs across two nodes. The well-known ILSVRC 2012 image dataset was used, which contains 1,281,167 training images and 50,000 validation images. This dataset was stored in Isilon F800 storage. The neural network models that have been benchmarked include Resnet50 and VGG16. The benchmarking was performed in both FP32 (32-bit floating points) and FP16 (mixed precision of both 32-bit floating points and 16-bit floating points) mode. In FP32 mode, the batch size was using 64 per GPU for VGG16 model and 128 per GPU for Resnet50 model. In FP16 mode, the corresponding batch size is doubled. The performance metric is the training speed in images/sec.
Performance Evaluation
Figure 2 and Figure 3 show the performance comparison for both C4140 SXM2 configurations with Resnet50 and VGG16 models. To simplify the notation, we denote the configuration K and configuration M of C4140 server as C4140-K and C4140-M, respectively. The following conclusions can be made based on these results:
Conclusions and Future Work
In this blog, we compared the deep learning performance on both configuration K and configuration M of Dell EMC PowerEdge C4140 server. Both Resnet50 and VGG16 models were benchmarked. For single node and Resnet50 model, C4140-M is 5% better than C4140-K and up to 10% performance improvement was measured for two nodes. For VGG16 model, C4140-M performs similar to C4140-K at one node, but up to 10% improvement was noted at two nodes. From the perspective of scaling efficiency, for both models C4140-M is ~7% higher than C4140-K with 8 GPUs (two nodes). The C4140-M has a performance advantage because it has 4x the PCIe links of C4140-K which leads to less CPU to GPU data transfer time and All Reduce collective communication time among the GPUs across nodes. In the future work, we will benchmark different types of neural networks such as Recurrent Neural Networks and compare the performance difference in both configurations.
High Performance Computing Solution Resources, Poweredge C4140
04 Dec 2020
2
Solution