WRF Performance on AMD ROME Platform - Multi-node Study

Table of Contents

Detailed Article

Symptoms

Resolution

Affected Products

Provide Feedback

This article applies to This article does not apply to

Check out resources for

Symptoms

Puneet Singh, HPC and AI Innovation Lab, June 2020

The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model which is used, predominantly in a multi-compute node environment, for atmospheric research and operational forecasts. This blog, in which we show the WRF performance for simulation spawning across multiple compute nodes, is follow up to a blog published on the single-node performance of WRF model.

Resolution

WRF is known to exhibit adequate scalability on multi-node systems. To verify the scalability on the AMD platform we have used the dual-socket Dell EMC PowerEdge servers, connected via the Mellanox HDR 100 interconnect, from our Minerva HPC cluster. The nodes used for benchmarking were part of a single switch. Details of the server and software used for the test are mentioned in Table 1.

Table 1: Testbed hardware and software details

Platform	Dell EMC PowerEdge 2 socket servers (with AMD Rome processors)	Dell EMC PowerEdge 2 socket servers (with AMD Naples processors)
CPU	2x AMD EPYC 7452@2.2 GHz (Rome) TDP - 155W	2x AMD EPYC 7601@2.35 GHz (Naples) TDP - 180W
Memory	256 GB, 16 x 16 GB DIMMs @ 3200 MT/s	256 GB, 16 x 16GB DIMM @ 2400 MT/s
Storage	NFS	NFS
Operating System	RHEL 7.6	RHEL 7.5
Linux Kernel	3.10.0-957.27.2.el7.x86_64	3.10.0-862.el7.x86_64
Interconnect	HDR 100 (OFED 4.7-3.2.9)	EDR (OFED 4.4-1.0.0)
BIOS version	1.0.1	1.10.6
Compiler & MPI	Intel 2018 Update4
Application	WRF 3.9.1.1
Benchmark Dataset	Conus 2.5km, Maria 3km

WRF has compiled with the "dm + sm" configuration. To optimize performance, we tried different process – thread combinations, tiling schemes (WRF_NUM_TILES), and Transparent Huge Pages options. We found that 1 process per CPU Complex (CCX) with THP disabled gave the best results.

SLN321968_en_US__1image(18611)

Figure 1: Scalability with Conus 2.5km dataset on Rome 7452 CPU model (HDR 100)

Figure 1 shows the scalability of the WRF application with the Conus 2.5km dataset – up to 64 nodes. During the single node runs (with AMD EPYC 7452 CPU model) it was observed that the optimal value of tile can improve the performance of the application by up to ~47%. Different tile sizes, ranging from 4 (default) to 108, were tried during the multi-node runs with the Conus dataset. For this dataset, there was a ~28x increase in the total data exchanged between the MPI processes from 2 nodes to 64 nodes.

SLN321968_en_US__2image(18614)

Figure 2: Scalability with Maria 3km dataset on Rome 7452 CPU model (HDR 100)

For Maria dataset tiles selected for simulation, were in the range of 4 (default) to 200 and with optimal tiles (on a single node) there was ~43% improvement in the application performance. There was a ~3x increase in the total data exchanged between the MPI processes from 2 nodes to 16 nodes.

WRF Profiling – Overall MPI Usage

In addition to the optimal tile size, CPU cores & CPU frequency, MPI Communication is another factor that impacts the application performance during multi-node runs. Here is the breakdown of the top 5 most used MPI functions for Conus and Maria datasets-

SLN321968_en_US__3image(18596)

Figure 3: Breakup of top 5 MPI functions for Conus 2.5km dataset simulation. Bars are plotted against the primary axis and line is plotted against the secondary axis

SLN321968_en_US__4image(18597)

Figure 4: Breakup of top 5 MPI functions for Maria 3km dataset simulation. Bars are plotted against the primary axis and line is plotted against the secondary axis

For both Conus and Maria datasets (Figure 3, Figure 4) it was observed that the total MPI time increases to ~5x & 4x respectively as we scale the benchmark up to 16 nodes. The broadcast function time rises to ~2x on 16 nodes, so different broadcast methods using I_MPI_ADJUST_BCAST were attempted to check if any of algorithm can take better advantage of the lower latency and higher throughput offered by HDR 100 interconnect. The broadcast and wait times got ~3 to 4.5% optimized but the overall improvement in performance was not significant (~1%).

Performance comparison with Naples platform

Bigger cache, higher memory throughput, better AVX2 support and higher overclocked CPU frequency are some of the features which make AMD 7452 Rome processor, architecturally superior to the AMD EPYC 7601 processor. The details of the systems under test are mentioned in Table 1.

SLN321968_en_US__5image(18612)

Figure 5: Scalability comparison of WRF on Rome and Naples platform for Conus 2.5km dataset. Bars are plotted against the primary axis and line is plotted against the secondary axis

SLN321968_en_US__6image(18613)
Figure 6: Scalability comparison of WRF on Rome and Naples platform for Maria 3 km dataset. Bars are plotted against the primary axis and line is plotted against the secondary axis

In Figures 5 and 6 with Naples performance numbers as a baseline, Rome platform delivers ~17 - 32% better performance on 8 nodes with Maria & Conus datasets.

Conclusions and Recommendations

WRF scales well with conus dataset and scalability may vary depending on the dataset being used. For WRF simulations on the AMD Rome platform, in addition to the settings mentioned here, we recommend Mellanox HDR 100 interconnect with Intel compilers ( 4 threads per process and optimal tile size value ) to obtain maximum performance. Watch this blog site for updates.

Affected Products

PowerEdge C6525

WRF Performance on AMD ROME Platform - Multi-node Study

Symptoms

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

WRF Performance on AMD ROME Platform - Multi-node Study

Detailed Article

Symptoms

Resolution

Affected Products

Symptoms

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services