Abstract
The deep learning inference performance has been evaluated on Dell EMC PowerEdge R740, using MLPerf inference v0.5 benchmarks. The performance evaluation was performed on 4x Nvidia Tesla T4 GPUs within one R740 server. The results indicated that the system delivered the top inference performance normalized to processor count among commercially available results.
Overview
Inference is the goal of deep learning after neural network model training. Inferencing can be done in data centers, at the edge and in IoT devices. Each of these environments have different requirements, therefore it is difficult to evaluate their performance with a unified benchmark. MLPerf is the new industry standard benchmark suite with the goal of measuring both training and inference performance on machine learning systems. The first MLPerf inference v0.5 benchmarks and results were published recently. Table 1 lists all benchmarks and datasets available in MLPerf inference v0.5.
In the MLPerf inference evaluation framework, a load generator called LoadGen sends inference queries to the system under test (SUT), and then the SUT utilizes a backend (e.g. TensorRT, TensorFlow, PyTorch) to do the inferencing and sends the results back to LoadGen. There are four scenarios regarding how the queries are sent and received:
The detailed inference rules and the latency constraints are described here. This blog only focuses on Server and Offline scenarios as they are designed more towards data center environments, while Single-Stream and Multi-Stream are designed towards edge and IoT devices.
Figure 1 shows the hardware topology of the Dell EMC PowerEdge R740 used in the inference evaluation. It has dual Intel Xeon Skylake CPUs and four Nvidia Tesla T4 GPUs. Each CPU is connected to two GPUs with two PCIe x16 busses. This ensures a balanced configuration and the high number of PCIe lanes guarantee fast data transfer between CPU and GPU. In the performance evaluation, the Nvidia TensorRT 6.0 library was used as the inference backend. The library was included with the NGC TensorRT 19.09 container.
TensorRT 6.0 includes support for new features including: reformat free I/O and layer fusions. These new features help to accelerate the inference in MLPerf benchmarks. Table 2 is a detailed list of the hardware and software used in the inference evaluation.
Performance Evaluation
In order to achieve optimal inference results, some parameter tuning is necessary. As shown in our previous blog "Deep Learning Inference on P40 vs P4 with Skylake", inference throughput increases with an increase in batch size, however it may reach a plateau or even decrease after some point. Therefore, the optimal batch size needs to be found for both Server and Offline scenarios. For the Server scenario, the optimal batch size also needs to satisfy the latency constraint.
Table 3 shows the results of all MLPerf inference benchmarks for the Server and Offline scenarios. The Dell EMC R740 with four T4 GPUs delivered the top inference performance normalized to processor count among commercially available results. All publicly available MLPerf inference v0.5 results are available here.
Conclusions
In this blog, we quantified the inference performance on a Dell EMC PowerEdge R740 server with four Nvidia Tesla T4 GPUs, using MLPerf Inference v0.5 benchmarks. The system delivered the top inference performance normalized to processor count among commercially available results.