Solution Overview
This blog describes the Dell EMC HPC NFS Storage Solution version 7.4 (NSS7.4-HA), which leverages Intel’s second-generation Xeon® Scalable Processors, codenamed "Cascade Lake". These improved Xeon processors feature up to 28 cores, up to 38.5 MB of last level cache and six 2933 MT/s memory channels per socket. The key features of Cascade Lake processors are the integrated
hardware mitigations against side channel attacks, the
Intel DL Boost (VNNI) and the support for increased clock speeds and memory speeds.
Cascade Lake and its predecessor Skylake include a feature called
ADDDC (
Adaptive
Double
DRAM
Device
Correction). ADDDC is deployed at runtime to dynamically map out failing DRAM devices while continuing to provide
Single
Device
Data
Correction (SDDC), Error-correcting code (ECC) memory, translating to increased DIMM longevity. This feature is activated only for x4 DRAM devices and does nothing when there are x8 DRAM devices in the system. Since the latest NSS-HA version 7.4 uses only the 16GB memory, which is x8 organization, ADDDC is greyed out and is not a tunable option in the BIOS. However, if you are using 32GB memory which is x4, then ADDDC will be available as a tunable option and it is recommended to be set to disabled to favor performance over the RAS features.
It is recommended to configure the NFS servers with the HPC profile as described in the blog "
BIOS characterization for Intel Cascade Lake processors" This includes tuning the BIOS to set Sub-NUMA cluster enabled, logical processor disabled, and system profile set to "Performance". If upgrading an existing system, ensure the BIOS is first updated to a version that supports Cascade Lake CPUs before upgrading the CPUs to Cascade Lake processors. The latest iDRAC firmware is also recommended.
The HPC Engineering team at the
HPC and AI Innovation Lab, performed a series of benchmark tests with NSS servers equipped with Cascade Lake processors and compared the results, to those, previously obtained from
NSS7.3-HA solution, which used the latest version of PowerEdge servers equipped with the previous generation "Skylake-SP" Xeon family processors. The benchmark results and the comparison are presented in this blog.
The NFS storage solution provided by Dell EMC is optimized and tuned for best performance. While setting up the NSS7.4-HA solution., the following salient points should be noted:
- The minimum supported operating system for use of Cascade Lake Processors is Red Hat Enterprise Linux 7.6. However, with kernel version 3.10.0-957.el7, NFS share will hang with a task, such as kworker, consuming 100% of the CPU. The root cause of the issue is due to the TCP layer getting out of sync with the sunrpc layers transport state. This issue has been resolved with the package kernel-3.10.0-957.5.1.el7 or later. So, the base operating system used for this solution is RHEL7.6 and the kernel version used is kernel-3.10.0-957.5.1.el7. Please refer to https://access.redhat.com/solutions/3742871 for more details.
- For NSS7.4-HA solution, unless the following packages are installed, the nfsserver resource fails to start because nfs-idmapd.service fails to start. Please refer https://access.redhat.com/solutions/3746891 for more details.
- resource-agents-4.1.1-12.el7_6.4
- resource-agents-aliyun-4.1.1-12.el7_6.4
- resource-agents-gcp-4.1.1-12.el7_6.4 or later.
- The release notes of RHEL7.6 draws attention to the fact that a bug in the I/O layer of LVM causes data corruption in the first 128KB of allocatable space of a physical volume. The problem has been solved with lvm2-2.02.180-10.el7_6.2 or later. So make sure that the lvm2 package is updated to the latest version. If updating lvm2 is not an option, then the work around would be to not use LVM commands that change volume group (VG) metadata such as lvcreate or lvextend, while logical volumes in the VG are in use.
NSS7.4-HA Architecture
Figure 1 shows the design of NSS7.4-HA. Except for necessary software and firmware updates, NSS7.4-HA and NSS7.3-HA share the same HA cluster configuration and storage configuration. The pair of NFS servers in active-passive high availability configuration are attached to the PowerVault ME4084. There are dual SAS cards in each NFS server. Each card has a SAS cable to each controller in the shared storage, so that a single SAS card or SAS cable failure does not impact the data availability. (Refer to
NSS7.3-HA white paper for more detailed information about the configuration).
Figure 1: NSS7.4-HA Architecture
Comparison of components in NSS7.4-HA vs NSS7.3-HA
Although Dell NSS-HA solutions have received many hardware and software upgrades to offer higher availability, higher performance, and larger storage capacity since the first NSS-HA release, the architectural design and deployment guidelines of the NSS-HA solution family remain unchanged. This latest version and the earlier version, NSS7.3-HA share the same storage backend which is Power Vault ME4084. The following table provides a comparison of the components in the latest NSS7.4-HA solution and the earlier NSS7.3-HA solution
Table 1: Comparision of the components in NSS7.4-HA vs NSS7.3-HA
Solution |
NSS7.4-HA Release (June 2019) |
NSS7.3-HA Release (October 2018) |
NFS Server Model |
2x Dell EMC PowerEdge R740 |
Internal Connectivity |
Gigabit Ethernet using Dell Networking S3048-ON |
Storage Subsystem |
Dell EMC PowerVault ME4084 84 - 3.5" NL SAS drives, up to 12TB. Supports up to 1008TB (raw space) 8 LUNs, linear 8+2 RAID 6, chunk size 128KiB. 4 Global HDD spares. |
Storage Connection |
12 Gbps SAS connections. |
Processor |
2x Intel Xeon Gold 6240 @ 2.6 GHz, 18 cores per processor |
2x Intel Xeon Gold 6136 @ 3.0 GHz, 12 cores per processor |
Memory |
12 x 16GiB 2933 MT/s RDIMMs |
12 x 16GiB 2666 MT/s RDIMMs |
Operating System |
Red Hat Enterprise Linux 7.6 |
Red Hat Enterprise Linux 7.5 |
Kernel version |
3.10.0-957.5.1.el7.x86_64 |
3.10.0-862.el7.x86_64 |
Red Hat Scalable File System (XFS) |
v4.5.0-18 |
v4.5.0-15 |
External Network Connectivity |
Mellanox ConnectX-5 InfiniBand EDR/100 GbE, and 10 GbE |
Mellanox ConnectX-5 InfiniBand EDR and 10 GbE. For NSS7.3-HA solution blog, Mellanox ConnectX-4 IB EDR/100 GbE, was used. |
OFED Version |
Mellanox OFED 4.5-1.0.1.0 |
Mellanox OFED 4.4-1.0.0 |
In the rest of the blog, the test bed and the I/O performance information of NSS7.4-HA will be presented. To show the performance difference between NSS7.4-HA and the previous release, the corresponding performance numbers of NSS7.3-HA are also presented.
Testbed Configuration
The testbed used to evaluate the performance and functionality of the NSS7.4-HA solution is described here. Note that the CPUs used for performance testing are different from that selected for the solution since the Xeon Gold 6240 CPUs were not received on time for this work. The plan is to repeat some of the testing once the 6240 processors are available and amend this report as needed.
Table 2: NSS7.4-HA Hardware Configuration
Server Configuration |
NFS Server Model |
Dell PowerEdge R740 |
Processor |
2x Intel Xeon Gold 6244 CPU @ 3.60GHz with 8 cores each |
Memory |
12 x 16GiB 2933 MT/s RDIMMs |
Local disks and RAID Controller |
PERC H730P with five 300GB 15K SAS hard drives. Two drives are configured in RAID1 for the OS, two drives are configured in RAID0 for swap space, and the fifth drive is a hot spare for RAID1 disk group. |
Mellanox EDR card (slot 8) |
Mellanox ConnectX-5 EDR card |
1GbE Ethernet card (Daughter card slot) |
Broadcom 5720 QP 1 Gigabit Ethernet network daughter card. Or Intel(R) Gigabit 4P I350-t rNDC |
External storage controller (slot 1 and slot 2) |
Two Dell 12Gbps SAS HBAs |
Systems Management |
iDRAC9 Enterprise |
Storage Configuration |
Storage Enclosure |
1x Dell PowerVault ME4084 enclosure |
RAID controllers |
Duplex RAID controllers in the Dell ME4084 |
Hard Disk Drives |
84 - 10TB 7.2K NL SAS drives per array, 84 x 10TB disk in total |
Other Components |
Private Gigabit Ethernet switch |
Dell Networking S3048-ON |
Power Distribution Unit |
Two APC switched Rack PDUs, model AP7921B |
Table 3: NSS7.4-HA Server Software versions
Component |
Description |
Operating System |
Red Hat Enterprise Linux (RHEL) 7.6 x86_64 errata |
Kernel version |
3.10.0-957.5.1.el7.x86_64 |
Cluster Suite |
Red Hat Cluster Suite from RHEL 7.6 |
Filesystem |
Red Hat Scalable File System (XFS) 4.5.0-18. |
Systems Management tool |
Dell OpenManage Server Administrator 9.3.0-3407_A00 |
Table 4: NSS7.4-HA Client Configuration
Component |
Description |
Servers |
32x Dell EMC PowerEdge C6420 Compute Nodes |
CPU |
2x Intel Xeon Gold 6148 CPU @ 2.40GHz with 20 cores per processor |
Memory |
12 x 16GiB 2666 MT/s RDIMMs |
Operating System |
Red Hat Enterprise Linux Server release 7.6 |
Kernel Version |
3.10.0-957.el7.x86_64 |
Interconnect |
Mellanox InfiniBand EDR |
OFED version |
4.3-1.0.1.0 |
ConnectX-4 firmware |
12.17.2052 |
NSS7.4-HA I/O performance summary
This section presents the results of the I/O performance tests for the current NSS7.4 solution. All performance tests were conducted in a failure-free scenario to measure the maximum capability of the solution. The tests focused on three types of I/O patterns: large sequential reads and writes, small random reads and writes, and three metadata operations (file create, stat, and remove). Like the previous version NSS7.3-HA, the solution uses deadline I/O scheduler, and 256 NFS daemons.
An 840TB (raw storage size) configuration was benchmarked with IPoIB network connectivity over EDR. A 32-node compute cluster was used to generate workload for the benchmarking tests. Each test was run over a range of clients to test the scalability of the solution.
The IOzone and mdtest benchmarks were used in this study. IOzone was used for the sequential and random tests. For sequential tests, a request size of 1024KiB was used. The total amount of data transferred was 2TB to ensure that the NFS server cache was saturated. Random tests used a 4KiB request size and each client read and wrote a 4GiB file. Metadata tests were performed using the mdtest benchmark with OpenMPI and included file create, stat, and remove operations. (Refer to Appendix A of the
NSS7.3-HA white paper for the complete commands used in the tests.)
IPoIB sequential writes and reads N-N
To evaluate sequential reads and writes the IOzone benchmark, version 3.487, was used in the sequential read and write mode. These tests were conducted on multiple thread counts starting at 1 thread and increasing in powers of 2, up to 64 threads. At each thread count an equal number of files were generated, since this test works on one file per thread or the N-N case. An aggregate file size of 2TB has been selected, which is equally divided among the number of threads within any given test.
Figure 2 provides a comparison of the sequential I/O performance of NSS7.4-HA version with that of the NSS7.3-HA version. From the figure it is observed that the latest NSS7.4 and the previous NSS7.3 have similar peak performance, with read at ~ 7 GB/s, and the peak write performance ~ 5 GB/s. However, at some thread counts a 15-20% decrease in write performance was measured when compared to the NSS7.3-HA solution. Investigation of this performance difference is work in progress. The read performance registered an increase of almost 45% at thread counts 1 and 2 and an increase of 18% at thread count 8. For thread counts higher than 8, the read performance is similar to that of the NSS7.3-HA solution. The increase in read performance at lower thread count is likely due to the hardware mitigations that are in place on the Cascade Lake processors against the side channel attacks.
Figure 2: IPoIB large sequential I/O performance
IPoIB random writes and reads N-N
To evaluate random IO performance, IOzone version 3.487 was used in the random mode. Tests were conducted on thread counts starting from 1 up to 64 in powers of 2. Record size was chosen to be 4KB. Each client read or wrote a 4GiB file to simulate small random data accesses. Since the cluster had only 32 nodes, the 64-thread data point was obtained with 32 clients running 2 threads each.
Figure 3 shows the comparison of random write and read I/O performance of NSS7.4-HA with that of NSS7.3-HA. From the figure, it is observed that NSS7.4 has similar random write peak performance as NSS7.3-HA, ~ 7300 IOPS. In NSS7.4-HA solution, for the lower thread count of 1 and 2, the write performance is approximately 14% less compared to the previous version of the solution and this is under investigation. The random read performance increases steadily on the NSS7.4 and reaches the peak performance of 16607 IOPS at 64 threads. In the previous release, (NSS7.3-HA), the peak performance of 28811 IOPS had been achieved at 32threads, which is 42% higher than the peak performance achieved for random reads in the NSS7.4-HA solution.
Figure 3: IPoIB random I/O performance
IPoIB metadata operations
To evaluate the metadata performance of the system, MDTest tool version 1.9.3 has been used. The MPI distribution used was OpenMPI version 1.10.7. The metadata tests were performed by creating 960000 files for thread count up to 32 and then increasing the number of files, to test the scalability of solution as tabulated in Table5.
Table 5: Metadata Tests: Distribution of files and directories across threads
Number of Threads |
Number of Files per directory |
Number of Directories per thread |
Total number of files |
1 |
3000 |
320 |
960000 |
2 |
3000 |
160 |
960000 |
4 |
3000 |
80 |
960000 |
8 |
3000 |
40 |
960000 |
16 |
3000 |
20 |
960000 |
32 |
3000 |
10 |
960000 |
64 |
3000 |
8 |
1536000 |
128 |
3000 |
4 |
1436000 |
256 |
3000 |
4 |
3072000 |
512 |
3000 |
4 |
6144000 |
Figure 4, Figure 5, and Figure 6 show respectively the results of file create, stat, and remove operations. As the HPC compute cluster has 32 compute nodes, in the graphs below, each client executed a maximum of one thread per node for counts up to 32. For client counts of 64, 128, 256, and 512, each node executed 2, 4, 8, or 16 simultaneous operations.
On file creates, there is a 20% improvement in performance up to 16 threads and from 32 threads onwards, the performance of both the versions is almost similar.
Stat operations in NSS7.4 registered a 10% improvement in performance for the lower thread counts (1,2, 8 and 16) and a >30% decrease in performance at higher thread counts (from 64 threads to 512 threads).
Finally, remove operations had a 14% decrease in performance up to 64 clients and a >20% decrease for higher thread count of 128,256 and 512.
Figure 4: IPoIB file create performance
Figure 5: IPoIB fiel stat performance
Figure 6: IPoIB fie remove performance
Conclusion
The following table summarizes the performance difference observed between the last NSS7.4 and NSS7.3 solutions.
Table 5: Comparision of performance of NSS7.4 and NSS7.3 HA versions
Dell EMC HPC NFS Storage |
NSS7.4-HA |
NSS7.4 – HA NSS7.3-HA |
Seq. 1MB Writes Peak: 1.4% decrease |
4,834 MB/s |
4,906 MB/s |
Seq. 1MB Reads Peak: 0.7% decrease |
7,024 MB/s |
7,073 MB/s |
Random 4KB Writes Peak: 0.7% decrease |
7,290 IOps |
7,341 IOps |
Random 4KB Reads Peak: 42% decrease |
16,607 IOps |
28,811 IOps |
Create operations/second Peak: 1.1% decrease |
54,197 Op/s |
54,795 Op/s |
Stat operations/second Peak: 35% decrease |
522,231 Op/s |
808,317 Op/s |
Remove operations/second Peak: 35% decrease |
47,345 Op/s |
73,320 Op/s |
From the above results, we can conclude that the current NSS7.4-HA solution provides comparable performance with that of its predecessor NSS7.3-HA solution. We plan to run the benchmark tests with Xeon Gold 6240 CPUs with 18 cores per processor, to understand if the decrease in performance for random reads and the decrease in performance at higher thread counts in the file stat and file remove operations is attributable to the lesser number of cores used in the Xeon Gold 6244 CPUs (8 cores per processor) used for performance benchmarking the NSS7.4-HA solution.
References
For detailed information about NSS-HA solutions, please refer to our published white papers