Bare Metal vs Kubernetes : Distributed Training with TensorFlow

Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019

-

Introduction

In this article, we evaluate scaling performance when training CheXNet on Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers using two approaches used in modern data centers. The traditional HPC "Bare Metal" with an environment built by Anaconda, and a containerized system with Nvidia GPU Cloud (NGC) containers running in an on-prem Kubernetes environment.

Bare Metal
A Bare metal system is a traditional HPC cluster where software stacks are installed directly on the local hard disk or a shared network mount. Management of software environments is performed by a system administrator. The users are restricted to building software in a shared /home filesystem. User code is batch scheduled by the Slurm workload manager.

Kubernetes
Our Kubernetes (K8s) system utilizes Nvidia’s NGC containers to provide all required software prerequisites, environment configs, etc. The system administrator only installs the base operating system, drivers, and k8s. These docker based containers can be downloaded from NGC during the run or stored in a local registry. K8s handles workload management, availability of resources, launching distributed jobs and scaling on demand.

Software Versions

	NGC Container nvcr.io/nvidia/tensorflow:19.06- py3	Conda env Versions
Framework	TensorFlow 1.13.1	TensorFlow 1.12.0
Horovod	0.15.1	0.16.1
MPI	OpenMPI 3.1.3	OpenMPI 4.0.0
CUDA	10.2	10.1
CUDA Driver	430.26	418.40.04
NCCL	2.4.7	2.4.7
CUDNN	7.6.0	7.6.0
Python	3.5.2	3.6.8
Operating System	Ubuntu 16.04.6	RHEL 7.4
GCC	5.4.0	7.2.0

Table 1

Real World Use Case: CheXNet

As introduced previously, CheXNet is an AI radiologist assistant model that uses DenseNet to identify up to 14 pathologies from a given chest x-ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalabilities on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in the SXM2 configuration.

Hardware Specifications

	Bare Metal System	Kubernetes System
Platform	PowerEdge C4140	PowerEdge C4140
CPU	2 x Intel® Xeon® Gold 6148 @2.4GHz	2 x Intel® Xeon® Gold 6148 @2.4GHz
Memory	384 GB DDR4 @ 2666MHz	384 GB DDR4 @ 2666MHz
Storage	Lustre	NFS
GPU	V100-SXM2 32GB	V100-SXM2 32GB
Operating System	RHEL 7.4 x86_64	CentOS 7.6
Linux Kernel	3.10.0-693.x86_64	3.10.0-957.21.3.el7.x86_64
Network	Mellanox EDR InfiniBand	Mellanox EDR InfiniBand (IP over IB)

Table 2

Performance

The image throughput, measured in images per second, when training CheXNet was measured using 1, 2, 3, 4, and 8 GPUs across 2 C4140 nodes on both systems described in Table 2. The specifications of the run including the model architecture, input data, etc. are detailed in this article . Figure 1 shows the measured performance comparison on the Kubernetes system and the bare metal system.

SLN318899_en_US__1image(12054)
Figure 1: Running CheXNet training on K8s vs Bare Metal

Summary

The bare metal system demonstrates an 8% increase in performance as we scale out to 8GPUs. However, the differences in the design of the system architecture could cause this slight performance difference, beyond just the container vs bare metal argument. The bare metal system can take advantage of the full bandwidth and latency of the raw InfiniBand connection and does not have to deal with the overhead created with Software Defined Networks such as a flannel. It is also the case that the K8s system is using IP over InfiniBand which can reduce available bandwidth.
These numbers may vary depending on the workload and the communication patterns defined by the kind of applications that are run. In the case of an image classification problem, the rate at which communication occurs between GPUs is high and thus there is a high exchange rate. However, whether to use one approach over the other is dependent on the needs of the workload. Although our Kubernetes based system has a small performance penalty, ~8% in this case, it relieves users and administrators from setting up libraries, configs, environments and other dependencies. This approach empowers the data scientists to be more productive and focus on solving core business problems such as data wrangling and model building.

Bare Metal vs Kubernetes : Distributed Training with TensorFlow

Summary: TensorFlow, Kubernetes, GPU, Distributed training

Symptoms

Cause

Resolution

Table of Contents

Introduction

Software Versions

Real World Use Case: CheXNet

Hardware Specifications

Performance

Summary

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Bare Metal vs Kubernetes : Distributed Training with TensorFlow

Summary: TensorFlow, Kubernetes, GPU, Distributed training

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Cause

Resolution

Table of Contents

Introduction

Software Versions

Real World Use Case: CheXNet

Hardware Specifications

Performance

Summary

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services