Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Article Number: 000142617


Deep Learning Performance on PowerEdge C4140 Configuration M

Summary: HPC, High Performance Computing, HPC and AI Innovation Lab, Deep Learning, TensorFlow, Horovod, C4140, Resnet50, VGG16

Article Content


Symptoms

Article written by Rengan Xu, Frank Han and Quy Ta of Dell EMC HPC & AI Innovation Lab in December 2018.

Resolution

Abstract

In the Dell EMC Ready Solutions for AI – Deep Learning with NVIDIA architecture guide, the ready solution that includes carefully selected technologies was described in detail including the details of design choice of each component. The architecture guide introduced two types of Dell EMC PowerEdge C4140 GPU servers: PCIe and SXM2. In this blog, another configuration of SXM2 is introduced and we will quantify the performance comparison for both SXM2 configurations.

Overview

The Dell EMC PowerEdge C4140 is a 2-socket, 1U rack server. The system features the Intel Skylake processors, up to 24 DIMMs, and 4 double width GPUs. There are two types of SXM2 configurations for C4140 server: configuration K and configuration M. The comparison of both configurations is shown in Figure 1. In configuration K, the two CPUs are connected to four GPUs by only one PCIe link. In configuration M, however, each CPU is connected to one GPU by PCIe link. Therefore, there are four PCIe links connecting the two CPUs with four GPUs. This blog will present the performance difference of deep learning training between these two configurations. The deep learning frameworks we benchmarked include TensorFlow and Horovod. Horovod is a distributed framework for TensorFlow. We used Horovod because it has better scalability implementation (using MPI model) than the distributed implementation in TensorFlow itself, which has been explained in the article "Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow".
                                                      SLN315045_en_US__1c4140m_fig1

 

Table 1 shows the hardware configuration and software details we tested. The evaluation was done on up to 8 GPUs across two nodes. The well-known ILSVRC 2012 image dataset was used, which contains 1,281,167 training images and 50,000 validation images. This dataset was stored in Isilon F800 storage. The neural network models that have been benchmarked include Resnet50 and VGG16. The benchmarking was performed in both FP32 (32-bit floating points) and FP16 (mixed precision of both 32-bit floating points and 16-bit floating points) mode. In FP32 mode, the batch size was using 64 per GPU for VGG16 model and 128 per GPU for Resnet50 model. In FP16 mode, the corresponding batch size is doubled. The performance metric is the training speed in images/sec.
SLN315045_en_US__2c4140m_table1

 

Performance Evaluation

Figure 2 and Figure 3 show the performance comparison for both C4140 SXM2 configurations with Resnet50 and VGG16 models. To simplify the notation, we denote the configuration K and configuration M of C4140 server as C4140-K and C4140-M, respectively. The following conclusions can be made based on these results:

  • There was no performance difference when either 1 GPU or 2 GPUs within a node were used, although when 2 GPUs are used, C4140-M uses two PCIe links while C4140-K only uses 1 PCIe link. This indicates that for both models, one PCIe link is fast enough to feed two GPUs.
  • There is performance variation starting from 4 GPUs and the C4140-M is better than C4140-K in most cases because it has three additional PCIe links. The detailed performance improvement is shown in Table 2. When 4 GPUs in a single node were used, it had no performance difference for VGG16 model but had a difference for Resent50 model. This means one PCIe link is fast enough to feed four GPUs for VGG16, but not for Resnet50. For Resnet50 model, C4140-M is 5.4% faster in FP32 and 7.4% faster in FP16 than C4140-K.  To find the reason for performance improvement, profiling was performed on C4140-K with 4 GPUs (single node). We measured an ~8% time in FP32 and ~9% time in FP16 being spent on the data transfer from CPU to GPUs. C4140-M can improve this time by 4x since it has 4x the PCIe links of C4140-K.
  • When two nodes (8 GPUs) were used, the performance improvement of C4140-M over C4140-K is much higher than 4 GPUs within one node. This is because not only the data transfer time from CPU to GPU is improved, the performance of All Reduce collective communication with NCCL library was also improved significantly.
  • Because C4140-M has the improved multi-node performance than C4140-K, it can achieve higher scaling efficiency. For Resnet50 model, the scaling efficiency of C4140-M versus C4140-K is 96.6%:90.0% for FP32, and 96.7%:87.1 for FP16. For VGG16 model, the scaling efficiency of C4140-M versus C4140-K is 76.9%:70.4% for FP32 and 77.8%:70.5% for FP16. Overall, for both models in both FP32 and FP16 modes, C4140-M has ~7% higher scaling efficiency than C4140-K when two nodes are used.
                                SLN315045_en_US__3c4140m_fig2                     
                             

SLN315045_en_US__4c4140m_table2

Conclusions and Future Work

In this blog, we compared the deep learning performance on both configuration K and configuration M of Dell EMC PowerEdge C4140 server.  Both Resnet50 and VGG16 models were benchmarked. For single node and Resnet50 model, C4140-M is 5% better than C4140-K and up to 10% performance improvement was measured for two nodes. For VGG16 model, C4140-M performs similar to C4140-K at one node, but up to 10% improvement was noted at two nodes. From the perspective of scaling efficiency, for both models C4140-M is ~7% higher than C4140-K with 8 GPUs (two nodes). The C4140-M has a performance advantage because it has 4x the PCIe links of C4140-K which leads to less CPU to GPU data transfer time and All Reduce collective communication time among the GPUs across nodes. In the future work, we will benchmark different types of neural networks such as Recurrent Neural Networks and compare the performance difference in both configurations.

 

Article Properties


Affected Product

High Performance Computing Solution Resources, Poweredge C4140

Last Published Date

04 Dec 2020

Version

2

Article Type

Solution