Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products

WRF Performance on AMD ROME Platform - Multi-node Study

This article applies to   This article does not apply to 

Symptoms

Puneet Singh, HPC and AI Innovation Lab, June 2020

The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model which is used, predominantly in a multi-compute node environment, for atmospheric research and operational forecasts. This blog, in which we show the WRF performance for simulation spawning across multiple compute nodes, is follow up to a blog published on the single-node performance of WRF model.

Resolution

WRF is known to exhibit adequate scalability on multi-node systems. To verify the scalability on the AMD platform we have used the dual-socket Dell EMC PowerEdge servers, connected via the Mellanox HDR 100 interconnect, from our Minerva HPC cluster. The nodes used for benchmarking were part of a single switch. Details of the server and software used for the test are mentioned in Table 1.

Table 1: Testbed hardware and software details

Platform

Dell EMC PowerEdge 2 socket servers

(with AMD Rome processors)

Dell EMC PowerEdge 2 socket servers

(with AMD Naples processors)

CPU

2x AMD EPYC 7452@2.2 GHz (Rome)

TDP - 155W

2x AMD EPYC 7601@2.35 GHz (Naples)

TDP - 180W

Memory

256 GB, 16 x 16 GB DIMMs @ 3200 MT/s

256 GB, 16 x 16GB DIMM @ 2400 MT/s

Storage

NFS

NFS

Operating System

RHEL 7.6

RHEL 7.5

Linux Kernel

3.10.0-957.27.2.el7.x86_64

3.10.0-862.el7.x86_64

Interconnect

HDR 100 (OFED 4.7-3.2.9)

EDR (OFED 4.4-1.0.0)

BIOS version

1.0.1

1.10.6

Compiler & MPI

Intel 2018 Update4

Application

WRF 3.9.1.1

Benchmark Dataset

Conus 2.5km, Maria 3km

 

WRF has compiled with the "dm + sm" configuration. To optimize performance, we tried different process – thread combinations, tiling schemes (WRF_NUM_TILES), and Transparent Huge Pages options. We found that 1 process per CPU Complex (CCX) with THP disabled gave the best results.

SLN321968_en_US__1image(18611)

Figure 1: Scalability with Conus 2.5km dataset on Rome 7452 CPU model (HDR 100)

Figure 1 shows the scalability of the WRF application with the Conus 2.5km dataset – up to 64 nodes. During the single node runs (with AMD EPYC 7452 CPU model)  it was observed that the optimal value of tile can improve the performance of the application by up to ~47%. Different tile sizes, ranging from 4 (default) to 108, were tried during the multi-node runs with the Conus dataset. For this dataset, there was a ~28x increase in the total data exchanged between the MPI processes from 2 nodes to 64 nodes. 

SLN321968_en_US__2image(18614)

Figure 2: Scalability with Maria 3km dataset on Rome 7452 CPU model (HDR 100)

For Maria dataset tiles selected for simulation, were in the range of 4 (default) to 200 and with optimal tiles (on a single node) there was ~43% improvement in the application performance. There was a ~3x increase in the total data exchanged between the MPI processes from 2 nodes to 16 nodes

WRF Profiling – Overall MPI Usage

In addition to the optimal tile size, CPU cores & CPU frequency, MPI Communication is another factor that impacts the application performance during multi-node runs. Here is the breakdown of the top 5 most used MPI functions for Conus and Maria datasets-

SLN321968_en_US__3image(18596)

Figure 3: Breakup of top 5 MPI functions for Conus 2.5km dataset simulation. Bars are plotted against the primary axis and line is plotted against the secondary axis

SLN321968_en_US__4image(18597)

Figure 4: Breakup of top 5 MPI functions for Maria 3km dataset simulation. Bars are plotted against the primary axis and line is plotted against the secondary axis

For both Conus and Maria datasets (Figure 3, Figure 4) it was observed that the total MPI time increases to ~5x & 4x respectively as we scale the benchmark up to 16 nodes. The broadcast function time rises to ~2x on 16 nodes, so different broadcast methods using I_MPI_ADJUST_BCAST were attempted to check if any of algorithm can take better advantage of the lower latency and higher throughput offered by HDR 100 interconnect. The broadcast and wait times got ~3 to 4.5% optimized but the overall improvement in performance was not significant (~1%).

Performance comparison with Naples platform

Bigger cache, higher memory throughput, better AVX2 support and higher overclocked CPU frequency are some of the features which make AMD 7452 Rome processor, architecturally superior to the AMD EPYC 7601 processor. The details of the systems under test are mentioned in Table 1.

SLN321968_en_US__5image(18612)

Figure 5: Scalability comparison of WRF on Rome and Naples platform for Conus 2.5km dataset. Bars are plotted against the primary axis and line is plotted against the secondary axis

SLN321968_en_US__6image(18613)
Figure 6: Scalability comparison of WRF on Rome and Naples platform for Maria 3 km dataset. Bars are plotted against the primary axis and line is plotted against the secondary axis

In Figures 5 and 6 with Naples performance numbers as a baseline, Rome platform delivers ~17 - 32% better performance on 8 nodes with Maria & Conus datasets.

Conclusions and Recommendations

WRF scales well with conus dataset and scalability may vary depending on the dataset being used. For WRF simulations on the AMD Rome platform, in addition to the settings mentioned here, we recommend Mellanox HDR 100 interconnect with Intel compilers ( 4 threads per process and optimal tile size value ) to obtain maximum performance. Watch this blog site for updates.

Affected Products

PowerEdge C6525