Article written by Nirmala Sundararajan of the Dell EMC HPC and AI Innovation Lab in November 2019
Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage
Table 1 and 2 describe the hardware specifications of management server and metadata/storage server respectively. Table 3 describes the software versions used for the solution.
Table 1 PowerEdge R640 Configuration (Management Server) | |
---|---|
Server | Dell EMC PowerEdge R640 |
Processor | 2x Intel Xeon Gold 5218 2.3 GHz, 16 cores |
Memory | 12 x 8GB DDR4 2666MT/s DIMMs - 96GB |
Local Disks | 6 x 300GB 15K RPM SAS 2.5in HDDs |
RAID Controller | PERC H740P Integrated RAID Controller |
Out of Band Management | iDRAC9 Enterprise with Lifecycle Controller |
Power Supplies | Dual 1100W Power Supply Units |
BIOS Version | 2.2.11 |
Operating System | CentOS™ 7.6 |
Kernel Version | 3.10.0-957.27.2.el7.x86_64 |
Table 2 PowerEdge R740xd Configuration (Metadata and Storage Servers) | |
---|---|
Server | Dell EMC PowerEdge R740xd |
Processor | 2x Intel Xeon Platinum 8268 CPU @ 2.90GHz, 24 cores |
Memory | 12 x 32GB DDR4 2933MT/s DIMMs - 384GB |
BOSS Card | 2x 240GB M.2 SATA SSDs in RAID 1 for OS |
Local Drives | 24x Dell Express Flash NVMe P4600 1.6TB 2.5" U.2 |
Mellanox EDR card | 2x Mellanox ConnectX-5 EDR card (Slots 1 & 8) |
Out of Band Management | iDRAC9 Enterprise with Lifecycle Controller |
Power Supplies | Dual 2000W Power Supply Units |
Table 3 Software Configuration (Metadata and Storage Servers) | |
---|---|
BIOS | 2.2.11 |
CPLD | 1.1.3 |
Operating System | CentOS™ 7.6 |
Kernel Version | 3.10.0-957.el7.x86_64 |
iDRAC | 3.34.34.34 |
Systems Management Tool | OpenManage Server Administrator 9.3.0-3407_A00 |
Mellanox OFED | 4.5-1.0.1.0 |
NVMe SSDs | QDV1DP13 |
*Intel ® Data Center Tool | 3.0.19 |
BeeGFS | 7.1.3 |
Grafana | 6.3.2 |
InfluxDB | 1.7.7 |
IOzone Benchmark | 3.487 |
The above example shows two different file systems mounted on the same client. For the purpose of this testing, 32x C6420 nodes were used as clients.$ cat /etc/beegfs/beegfs-mounts.conf
/mnt/beegfs-medium /etc/beegfs/beegfs-client-medium.conf
/mnt/beegfs-small /etc/beegfs/beegfs-client-small.conf
Figure 8 shows the testbed where the InfiniBand connections to the NUMA zone is highlighted. Each server has two IP links and the traffic through NUMA 0 zone is handed by interface IB0 while the traffic through NUMA 1 zone is handled by interface IB1.# cat /proc/sys/kernel/numa_balancing
0
Table 4 Client Configuration | |
---|---|
Clients | 32x Dell EMC PowerEdge C6420 Compute Nodes |
BIOS | 2.2.9 |
Processor | 2x Intel Xeon Gold 6148 CPU @ 2.40GHz with 20 cores per processor |
Memory | 12 x 16GB DDR4 2666 MT/s DIMMs - 192GB |
BOSS Card | 2x 120GB M.2 boot drives in RAID 1 for OS |
Operating System | Red Hat Enterprise Linux Server release 7.6 |
Kernel Version | 3.10.0-957.el7.x86_64 |
Interconnect | 1x Mellanox ConnectX-4 EDR card |
OFED Version | 4.5-1.0.1.0 |
To evaluate sequential reads and writes, the IOzone benchmark was used in the sequential read and write mode. These tests were conducted on multiple thread counts starting at 1 thread and increasing in powers of 2, up to 1024 threads. At each thread count, an equal number of files were generated since this test works on one file per thread or the N clients to N file (N-N) case. The processes were distributed across 32 physical client nodes in a round robin or cyclical fashion so that the requests are equally distributed and there is load balancing. An aggregate file size of 8TB was selected which was equally divided among the number of threads within any given test. The aggregate file size was chosen large enough to minimize the effects of caching from the servers as well as from BeeGFS clients. IOzone was run in a combined mode of write then read (-i 0, -i 1) to allow it to coordinate the boundaries between the operations. For this testing and results, we used a 1MiB record size for every run. The commands used for Sequential N-N tests are given below:
Sequential Writes and Reads: iozone -i 0 -i 1 -c -e -w -r 1m -I -s $Size -t $Thread -+n -+m /path/to/threadlist
OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command:
# sync && echo 3 > /proc/sys/vm/drop_caches
The default stripe count for Beegfs is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis. For all these tests, BeeGFS stripe size was chosen to be 2MB and stripe count was chosen to be 3 since we have three targets per NUMA zone as shown below:
$ beegfs-ctl --getentryinfo --mount=/mnt/beegfs /mnt/beegfs/benchmark --verbose
EntryID: 0-5D9BA1BC-1
ParentID: root
Metadata node: node001-numa0-4 [ID: 4]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 2M
+ Number of storage targets: desired: 3
+ Storage Pool: 1 (Default)
Inode hash path: 7/5E/0-5D9BA1BC-1
The transparent huge pages were disabled, and the following tuning options are in place on the metadata and storage servers:
- vm.dirty_background_ratio = 5
- vm.dirty_ratio = 20
- vm.min_free_kbytes = 262144
- vm.vfs_cache_pressure = 50
- vm.zone_reclaim_mode = 2
- kernel.numa_balancing = 0
In addition to the above, the following BeeGFS tuning options were used:
In Figure 9, we see that peak read performance is 132 GB/s at 1024 threads and peak write is 121 GB/s at 256 threads. Each drive can provide 3.2 GB/s peak read performance and 1.3 GB/s peak write performance, which allows a theoretical peak of 422 GB/s for reads and 172 GB/s for writes. However, here the network is the limiting factor. We have a total of 11 InfiniBand EDR links for the storage servers in the set up. Each link can provide a theoretical peak performance of 12.4 GB/s which allows a theoretical peak performance of 136.4 GB/s. The achieved peak read and write performance are 97% and 89% respectively of the theoretical peak performance.
The single thread write performance is observed to be ~3 GB/s and read at ~3 GB/s. We observe that the write performance scales linearly, peaks at 256 threads and then starts decreasing. At lower thread counts read and write performance are the same. Because until 8 threads, we have 8 clients writing 8 files across 24 targets which means, not all storage targets are being fully utilized. We have 33 storage targets in the system and hence at least 11 threads are needed to fully utilize all the servers. The read performance registers a steady linear increase with the increase in the number of concurrent threads and we observe almost similar performance at 512 and 1024 threads.
We also observe that the read performance is lower than writes for thread counts from 16 to 128 and then the read performance starts scaling. This is because while a PCIe read operation is a Non-Posted Operation, requiring both a request and a completion, a PCIe write operation is a fire and forget operation. Once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. A write operation is a "Posted" operation that consists of a request only.
Read throughput is typically lower than the write throughput because reads require two transactions instead of a single write for the same amount of data. The PCI Express uses a split transaction model for reads. The read transaction includes the following steps:
The read throughput depends on the delay between the time the read request is issued and the time the completer takes to return the data. However, when the application issues enough number of read requests to cover this delay, then throughput is maximized. That is the reason why while the read performance is less than that of the writes from 16 threads to 128 threads, we measure an increased throughput when the number of requests increases. A lower throughput is measured when the requester waits for completion before issuing subsequent requests. A higher throughput is registered when multiple requests are issued to amortize the delay after the first data returns.
To evaluate random IO performance, IOzone was used in the random mode. Tests were conducted on thread counts starting from 4 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the disk. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size is used on IOzone. Performance is measured in I/O operations per second (IOPS). The OS caches were dropped between the runs on the BeeGFS servers as well as BeeGFS clients. The command used for executing the random writes and reads is given below:
Random reads and writes: iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist
Figure 10: Random Read and Write Performance using IOzone wth 8TB aggregate file size
The random writes peak at ~3.6 Million IOPS at 512 threads and the random reads peak at ~3.5 Million IOPS at 1024 threads as shown in Figure 10. Both the write and read performance show a higher performance when there are a higher number of IO requests. This is because NVMe standard supports up to 64K I/O queue and up to 64K commands per queue. This large pool of NVMe queues provide higher levels of I/O parallelism and hence we observe IOPS exceeding 3 Million.
This blog announces the release of the Dell EMC High Performance BeeGFS Storage Solution and highlights its performance characteristics. The solution has a peak sequential read and write performance of ~132 GB/s and ~121 GB/s respectively and the random writes peak at ~3.6 Million IOPS and random reads at ~3.5 Million IOPS.
This blog is part one of "BeeGFS Storage Solution" which has been designed with a focus on scratch space with high performance. Stay tuned for Part 2 of the blog series that will describe how the solution can be scaled by incrementing the number of servers to increase performance and capacity. Part 3 of the blog series will discuss additional features of BeeGFS and will highlight the use of "StorageBench", the built-in storage targets benchmark of BeeGFS.
As a part of the next steps, we will be publishing a white paper later with the metadata performance and the N threads to 1 file IOR performance and with additional details about design considerations, tuning and configuration.