Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019
NGC Container nvcr.io/nvidia/tensorflow:19.06- py3 |
Conda env Versions |
|
Framework |
TensorFlow 1.13.1 |
TensorFlow 1.12.0 |
Horovod |
0.15.1 |
0.16.1 |
MPI |
OpenMPI 3.1.3 |
OpenMPI 4.0.0 |
CUDA |
10.2 |
10.1 |
CUDA Driver |
430.26 |
418.40.04 |
NCCL |
2.4.7 |
2.4.7 |
CUDNN |
7.6.0 |
7.6.0 |
Python |
3.5.2 |
3.6.8 |
Operating System |
Ubuntu 16.04.6 |
RHEL 7.4 |
GCC |
5.4.0 |
7.2.0 |
Table 1
As introduced previously, CheXNet is an AI radiologist assistant model that uses DenseNet to identify up to 14 pathologies from a given chest x-ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalabilities on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in the SXM2 configuration.
Bare Metal System |
Kubernetes System |
|
Platform |
PowerEdge C4140 |
PowerEdge C4140 |
CPU |
2 x Intel® Xeon® Gold 6148 @2.4GHz |
2 x Intel® Xeon® Gold 6148 @2.4GHz |
Memory |
384 GB DDR4 @ 2666MHz |
384 GB DDR4 @ 2666MHz |
Storage |
Lustre |
NFS |
GPU |
V100-SXM2 32GB |
V100-SXM2 32GB |
Operating System |
RHEL 7.4 x86_64 |
CentOS 7.6 |
Linux Kernel |
3.10.0-693.x86_64 |
3.10.0-957.21.3.el7.x86_64 |
Network |
Mellanox EDR InfiniBand |
Mellanox EDR InfiniBand (IP over IB) |
The image throughput, measured in images per second, when training CheXNet was measured using 1, 2, 3, 4, and 8 GPUs across 2 C4140 nodes on both systems described in Table 2. The specifications of the run including the model architecture, input data, etc. are detailed in this article . Figure 1 shows the measured performance comparison on the Kubernetes system and the bare metal system.
Figure 1: Running CheXNet training on K8s vs Bare Metal
The bare metal system demonstrates an 8% increase in performance as we scale out to 8GPUs. However, the differences in the design of the system architecture could cause this slight performance difference, beyond just the container vs bare metal argument. The bare metal system can take advantage of the full bandwidth and latency of the raw InfiniBand connection and does not have to deal with the overhead created with Software Defined Networks such as a flannel. It is also the case that the K8s system is using IP over InfiniBand which can reduce available bandwidth.
These numbers may vary depending on the workload and the communication patterns defined by the kind of applications that are run. In the case of an image classification problem, the rate at which communication occurs between GPUs is high and thus there is a high exchange rate. However, whether to use one approach over the other is dependent on the needs of the workload. Although our Kubernetes based system has a small performance penalty, ~8% in this case, it relieves users and administrators from setting up libraries, configs, environments and other dependencies. This approach empowers the data scientists to be more productive and focus on solving core business problems such as data wrangling and model building.