文書番号: 000130065

Deep Learning Performance on T4 GPUs with MLPerf Inference v0.5 Benchmarks

文書の内容

現象

Ariticle written by Rengan Xu, Frank Han and Quy Ta of HPC and AI Innovation Lab in November 2019.

解決方法

Abstract

The deep learning inference performance has been evaluated on Dell EMC PowerEdge R740, using MLPerf inference v0.5 benchmarks. The performance evaluation was performed on 4x Nvidia Tesla T4 GPUs within one R740 server. The results indicated that the system delivered the top inference performance normalized to processor count among commercially available results.

Overview

Inference is the goal of deep learning after neural network model training. Inferencing can be done in data centers, at the edge and in IoT devices. Each of these environments have different requirements, therefore it is difficult to evaluate their performance with a unified benchmark. MLPerf is the new industry standard benchmark suite with the goal of measuring both training and inference performance on machine learning systems. The first MLPerf inference v0.5 benchmarks and results were published recently. Table 1 lists all benchmarks and datasets available in MLPerf inference v0.5.

In the MLPerf inference evaluation framework, a load generator called LoadGen sends inference queries to the system under test (SUT), and then the SUT utilizes a backend (e.g. TensorRT, TensorFlow, PyTorch) to do the inferencing and sends the results back to LoadGen. There are four scenarios regarding how the queries are sent and received:

Server: The queries are sent to the SUT following a Poisson distribution (to model real-world random events). One query has one sample. The metric is queries per second (QPS) within latency bound.
Offline: One query with all samples is sent to the SUT. The SUT can send the results back once or multiple times in any order. The metric is sample per second.
Single-Stream: One sample per query is sent to SUT. The next query will not be sent until the previous response is received. The metric is 90^th percentile latency.
Multi-Stream: A query with N samples is sent with fixed interval. The metric is max N when the latency of all queries is within a latency bound.

The detailed inference rules and the latency constraints are described here. This blog only focuses on Server and Offline scenarios as they are designed more towards data center environments, while Single-Stream and Multi-Stream are designed towards edge and IoT devices.
SLN319502_en_US__100_tab1

Figure 1 shows the hardware topology of the Dell EMC PowerEdge R740 used in the inference evaluation. It has dual Intel Xeon Skylake CPUs and four Nvidia Tesla T4 GPUs. Each CPU is connected to two GPUs with two PCIe x16 busses. This ensures a balanced configuration and the high number of PCIe lanes guarantee fast data transfer between CPU and GPU. In the performance evaluation, the Nvidia TensorRT 6.0 library was used as the inference backend. The library was included with the NGC TensorRT 19.09 container.

TensorRT 6.0 includes support for new features including: reformat free I/O and layer fusions. These new features help to accelerate the inference in MLPerf benchmarks. Table 2 is a detailed list of the hardware and software used in the inference evaluation.
SLN319502_en_US__200_fig1(1)

SLN319502_en_US__300_tab2

Performance Evaluation

In order to achieve optimal inference results, some parameter tuning is necessary. As shown in our previous blog "Deep Learning Inference on P40 vs P4 with Skylake", inference throughput increases with an increase in batch size, however it may reach a plateau or even decrease after some point. Therefore, the optimal batch size needs to be found for both Server and Offline scenarios. For the Server scenario, the optimal batch size also needs to satisfy the latency constraint.

Table 3 shows the results of all MLPerf inference benchmarks for the Server and Offline scenarios. The Dell EMC R740 with four T4 GPUs delivered the top inference performance normalized to processor count among commercially available results. All publicly available MLPerf inference v0.5 results are available here.

SLN319502_en_US__400_tab3(2)

Conclusions

In this blog, we quantified the inference performance on a Dell EMC PowerEdge R740 server with four Nvidia Tesla T4 GPUs, using MLPerf Inference v0.5 benchmarks. The system delivered the top inference performance normalized to processor count among commercially available results.

文書のプロパティ

影響を受ける製品

High Performance Computing Solution Resources

最後に公開された日付

21 2月 2021

バージョン

文書の種類

Solution

トップに戻る

ようこそ

Dellへようこそ