メイン コンテンツに進む
  • すばやく簡単にご注文が可能
  • 注文内容の表示、配送状況をトラック
  • 会員限定の特典や割引のご利用
  • 製品リストの作成とアクセスが可能

Dell EMC DSS 8440 Server Powered by NVIDIA RTX GPUs for HPC and AI Workloads

概要: The Dell EMC DSS8440 server is a 2 Socket, 4U server designed for High Performance Computing, Machine Learning (ML) and Deep Learning workloads. This article compares the performance of various GPUs such as NVIDIA Volta V100S  and NVIDIA Tesla T4  Tensor Core GPUs as well as NVIDIA quadro RTX GPUs in this system. ...

この記事は次に適用されます: この記事は次には適用されません: この記事は、特定の製品に関連付けられていません。 すべての製品パージョンがこの記事に記載されているわけではありません。

現象

 

Deepthi Cherlopalle and Frank Han

 

Dell EMC HPC and AI Innovation Lab June 2020

 

The Dell EMC DSS8440 server is a 2 Socket, 4U server designed for High Performance Computing, Machine Learning (ML) and Deep Learning workloads. It supports various GPUs such as NVIDIA Volta V100S SLN321776_en_US__1iC_External_Link_BD_v1 and NVIDIA Tesla T4 SLN321776_en_US__1iC_External_Link_BD_v1 Tensor Core GPUs as well as NVIDIA quadro RTX GPUs SLN321776_en_US__1iC_External_Link_BD_v1.

SLN321776_en_US__4image(18426)

(Figure.1 Dell EMC DSS840 Server)

In this blog, we  evaluate the performance of the cost-effective NVIDIA Quadro RTX 6000 and the NVIDIA Quadro RTX 8000 GPUs compared to the top tier accelerator V100S GPU by using various industry standard benchmarking tools. This includes testing against single vs double precision workloads. While the Quadro series has existed for a long time, RTX GPUs with NVIDIA Turing Architecture launched in late 2018. The specifications in Table 1 show that the RTX 8000 GPU is superior to the RTX 6000 in terms of higher memory configuration. However, the RTX 8000 and RTX 6000 GPUs have higher power needs compared to the V100S GPU. For workloads that require a higher memory capacity, the RTX 8000 is the better choice.

SPECIFICATIONS RTX 6000 RTX 8000 V100S-32 GB
Architecture Turing Volta
Memory 24 GB GDDR6 48 GB GDDR6 32 GB HBM2
Default clock rate (MHz) 1395 1245
GPU Maximum clock rate (MHz) 1770 1597
CUDA Cores 4608 5120
FP32(TFLOPS maximum) 16.3 16.4
Memory Bandwidth (GB/s) 672 1134
Power 295 W 250 W

Table.1 GPU Specifications

Server DellEMC PowerEdge DSS8440
Processor 2 x Intel Xeon 6248, 20 C @ 2.5 GHz
Memory 24 x 32 GB @ 2933 MT/s (768 GB Total)
GPU  8 x Quadro RTX 6000    8 x Quadro RTX 8000   8 x Volta V100S - PCIe 
Storage 1 x Dell Express Flash NVMe 1 TB 2.5" U.2 (P4500)
Power Supplies 4 x 2400 W

Table.2 Server configuration details

BIOS 2.5.4
OS RHEL 7.6
Kernel 3.10.0-957,ek7.x86_64
System Profile Performance Optimized
CUDA Toolkit
CUDA Driver
10.1
440.33.01

Table.3 System Firmware details

Application Version
HPL hpl_cuda_10.1_ompi-3.1_volta_pascal_kepler_3-14-19_ext
Intel MKL 2018 Update 4
LAMMPS March 3 2020
OpenMPI – 4.0.3
MLPERF v0.6 Training SLN321776_en_US__1iC_External_Link_BD_v1
docker 19.03

Table.4 Application Information

原因

LAMMPS

LAMMPS SLN321776_en_US__6iC_External_Link_BD_v1 is a Molecular Dynamics application that is maintained by researchers at Sandia National Laboratories and Temple University. LAMMPS was compiled with the KOKKOS package SLN321776_en_US__6iC_External_Link_BD_v1 to run efficiently on NVIDIA GPUs. Lennard Jones dataset was used for performance comparison and Timesteps/s being the metric as shown in Figure 2:

SLN321776_en_US__8image(18427)

(Figure.2 Lennard Jones Graph)

 As listed in Table 1, the RTX 6000 and RTX 8000 GPUs have the same number of cores, single precision performance and GPU bandwidth but different GPU memory. Because both RTX GPUs have a similar configuration the performance is also in the same range. RTX GPUs scale well for this application and the performance for both the GPUs is identical.

The Volta V100S GPU performance is approximately three times faster than the Quadro RTX GPUs. The key factor for this higher performance is the greater GPU memory bandwidth of the V100S GPU.


High Performance Linpack (HPL) Performance

HPL is a standard HPC benchmark that  measures computing performance. It is used as a reference benchmark by the TOP500 list to rank supercomputers worldwide.

The following figure shows the performance of RTX 6000, RTX 8000, and V100S  GPUs using DSS 8440 server. As you can see, the performance of the RTX GPUs are significantly lower than the V100S GPU. This is to be expected as the HPL performs a matrix LU factorization which is primarily double precision floating point operations.

SLN321776_en_US__9image(18428)

(Figure.3 HPL Performance with different GPUs)

If we compare the theoretical floating-point performance, that is, Rpeak of both GPUs, we see that the V100S GPU performance is much higher. The theoretical Rpeak value on a single RTX GPU is approximately 500GFlops. This value yields less performance (Rmax) per GPU. The Rpeak value for Volta V100S GPU is 8.2TFlops, which results in much higher performance from each card.


MLPerf

The need for industry-standard performance benchmarks for ML led to the development of the MLPerf suite. This suite includes benchmarks for evaluating training and inference performance of ML hardware and software.This section only addresses the training performance of GPUs. The following table lists the Deep Learning workloads, datasets, and target criteria that is used for evaluating the GPUs.

Benchmark Dataset Quality Target Reference Implementation Model
Image classification ImageNet (224x224) 75.9% Top-1 Accuracy Resnet-50 v1.5
Object detection
(light weight)
COCO 2017 23% mAP SSD-ResNet34
Object detection
(heavy weight)
COCO 2017 0.377 Box minimum AP
0.339 Mask minimum AP
Mask R-CNN
Translation
(recurrent)
WMT English-German 24.0 BLEU GNMT
Translation
(non-recurrent)
 WMT English-German  25.0 BLEU Transformer
Reinforcement learning N/A Pre-trained checkpoint Mini Go

Table.5 MLPerf datasets and target criteria (Source:https://mlperf.org/training-overview/#overview SLN321776_en_US__6iC_External_Link_BD_v1)

The following figure shows the time to meet the target criteria for both the RTX and V100S GPUs:

SLN321776_en_US__11image(18441)
(Figure.4 MLPERF Performance)

The results are considered after performing multiple runs, discarding the highest and lowest value, and averaging the other runs as per the guidelines listed. The performance for both the RTX GPUs is similar. The percentage of variance between both the RTX GPUs is minimal and within the acceptance range according to MLPerf guidelines. While Volta V100 GPU provides the best performance, the RTX GPUs also perform well except for object detection benchmark.

At the time of publication, the Image classification benchmark in MLPerf failed with RTX GPUs due to a convolution error. This issue is expected to be fixed in a future cuDNN release.

解決方法

Summary

In this blog, we discussed the performance of the Dell EMC DSS 8440 GPU Server and NVIDIA RTX GPUs for HPC and AI workloads. Performance for both RTX GPUs is similar, however , the RTX 8000 GPU would be a best choice for applications that require a higher amount of memory. For double precision workloads, or workloads that require high memory bandwidth Volta V100S and the new NVIDIA A100 GPU are the best choice.

In the future, we plan to provide a performance study on RTX GPUs with other single precision applications and an Inference study on RTX and A100 GPUs.


対象製品

High Performance Computing Solution Resources
文書のプロパティ
文書番号: 000132886
文書の種類: Solution
最終更新: 25 2月 2021
バージョン:  4
質問に対する他のDellユーザーからの回答を見つける
サポート サービス
お使いのデバイスがサポート サービスの対象かどうかを確認してください。