Abstract
Dell EMC Ready Solutions for AI – Deep Learning with NVIDA v1.1 and the corresponding reference architecture guide were released in February 2019. This blog will quantify the deep learning training performance on this reference architecture using ResNet-50 model. The performance evaluation will be scaled on up to eight nodes.
Overview
In August 2018, the initial version 1.0 of Dell EMC Ready Solutions for AI – Deep Learning with NVIDIA was released. In February 2019, this solution was updated to version 1.1. The main difference is that in version 1.1, the CPU and GPU connection topology has been changed from configuration K to configuration M. The comparison of these two different configurations is shown in Figure 1. Unlike configuration K which has only one PCIe link between two CPUs and four GPUs, the new configuration M has four PCIe links between them and the memory size of each GPU has changed from 16GB in Ready Solution v1.0 to 32GB in v1.1.
The ResNet-50 model was used to evaluate the performance of this ready solution. This is one of the models in MLPerf benchmark suite which is trying to establish the benchmark standard in machine learning field. Following the philosophy of MLPerf, we measured the wall clock time for ResNet-50 model training until the model converges to the target Top-1 evaluation accuracy 74.9%. The benchmark we used is from Nvidia Deep Learning Examples git repository. We added the distributed launch script from MXNet repository to run this model on distributed servers. The hardware and software details of this evaluation are list in Table 1.
Table 1: The hardware configuration and software details
Platform |
PowerEdge C4140 |
CPU |
2 x Intel® Xeon® Gold 6148 @3.0GHz (Skylake) |
Memory |
384 GB DDR4 @ 2666MHz |
Storage |
96 TB Isilon F800 |
GPU |
V100-SXM2 with 32GB memory |
OS and Firmware |
|
Operating System |
Red Hat® Enterprise Linux® 7.5 x86_64 |
Linux Kernel |
3.10.0-693.el7.x86_64 |
BIOS |
1.6.12 |
Deep Learning related |
|
MXNet |
Nvidia-mxnet-18.12-py3 container |
ResNet-50 v1.5 |
https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 in commit 0e66c6dabb8b4c90bd637e27aeb4e67722ca95fc |
Performance Evaluation
Figure 3 shows the ResNet-50 training time to the target accuracy 74.9% with the C4140-M in ready solution v1.1. Figure 4 shows the throughput comparison to the C4140-K in ready solution v1.0. Both throughput and time-to-accuracy results are shown here because these two metrics are not always correlated. The testing was scaled from one node (4 V100) to eight nodes (32 V100). The Dell EMC ready solution is a scale-out solution which can utilize more resources if more nodes are added in the solution. There is an alternate solution called scale-up solution from other vendors, which tries to put more GPUs into one server. We also compared our scale-out solution with other vendor’s scale-up solution in Figure 3. The following conclusions can be made from Figure 3 and Figure 4:
* Both data of scale-up systems was public available at the MLPerf v0.5 results web page.
Figure 3: The time to accuracy comparison
Figure 4: The throughput comparison
Storage and Network Analysis
How storage and network are utilized are analyzed in this section. The Isilon InsightIQ tool was used to monitor the usage of Isilon storage and the Mellanox Unified Fabric Manager (UFM) was used to monitor the usage of InfiniBand EDR. Figure 5 shows the Isilon disk throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:
(a) 1 node
(b) 2 nodes
(4) 4 nodes
(d) 8 nodes
Figure 5: The disk throughput from Isilon storage
Figure 6 shows the InfiniBand EDR send and receive throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:
(a) 1 node
(b) 2 nodes
(c) 4 nodes
(d) 8 nodes
Figure 6: The InfiniBand EDR throughput
Conclusions and Future Work
In this blog, we quantified the performance of the Dell EMC ready solution v1.1 with ResNet-50 v1.5 model. The result shows that the scale-out solution can achieve comparable performance with other scale-up solution. And compared to the ready solution v1.0, the current solution has much higher training throughput. The storage and network usage were also profiled. When the number of nodes doubled, the peak disk throughput increased by ~66%, and the network throughput increased by 100 MB/s. In the future work, we will further evaluate the performance of the ready solution with other benchmarks like Object detection, Translation and Recommendation in the MLPerf suite.