Table of Contents
Summary of Testing Methodology
The Data Accelerator (DAC) is a non-proprietary solid-state burst buffer collaboration with the University of Cambridge, Dell EMC, Intel, and StackHPC. The goal is to boost data intensive workloads that exceed the bandwidth and I/O operations capabilities of magnetic based local or centralized network storage. The DAC integrates with the job scheduler to allow multiple users to access their own temporary burst buffer based on requested capacity and available resources. Workflows can be further enhanced by providing access to features offered by high-performance parallel file system. The transient nature of this solution offers other possible use cases, such as the isolation of "noisy neighbor" workloads that consume resources and interfere with other users within the cluster ecosystem.
This technical blog discusses the reference architecture, performance results from initial IOZone benchmarks, preliminary insights, and future work.
In Figure 1 below, the base hardware configuration for this DAC reference architecture is one Dell EMC PowerEdge R740xd with 24 Intel P4610 solid state drives, and 2 Mellanox InfiniBand ConnectX-6 HDR100 adapters connected to a Mellanox Quantum HDR switch with passive copper HDR100 "splitter" cables. Identical server configurations are added based on capacity and performance requirements. The area within the green-dotted frame in Figures 1 and 2 indicate the DAC components that integrate into an existing HPC cluster.
The Technical Specifications section lists the details of the hardware. The high-level summary of core components in this example include:
Figure 1 Hardware Architecture
Figure 2 below extends upon the hardware layout of Figure 1 above with a high-level logical workflow of the Data Accelerator software. The resource management component is provided by etcd
, which is an open-source distributed key-value store. The dacctl
(Slurm interface) and dacd (DAC Server) process interface with etcd for state cohesion as SSDs and DAC servers are allocated before buffer creation and released after buffer teardown. The DAC controls integrate with the Slurm
burst buffer DataWarp plugin. End users submit a job on a login node using conventional slurm
mechanisms like sbatch
and srun
which contain the interface directives (examples provided in Performance section). These requests activate the burst buffer features of the Slurm
plugin. A callout is issued using the dacctl
command-line tool which will in turn activate the Orchestrator mechanisms to the dacd
daemon on the DAC servers.
There are two lifecycles that DAC burst buffers have available:
In addition to the lifecycles, there are three modes of operation for each buffer:
For this reference architecture, we will cover the use of a Per Job Lustre
file system in Striped mode accessed by the $DW_JOB_STRIPED environment variable. As the buffer life cycles are intended to enhance performance of existing central storage and be transient, no drive redundancy features are used. This ensures maximum utilization of SSD capacity and eliminating the overhead of RAID calculation.
Figure 2 Software Architecture
Administrators manually mount any existing file system on the DAC servers for the optional stage in (SI/ingress
) and stage out (SO/egress
) of user data. Once Slurm
determines the next job to run, it calls out the request to the Orchestrator. While the job remains in a PENDING state, dacd
initiates the creation of a transient Lustre file system (burst buffer) along with the optional stage in of data specified by the user. The Orchestrator uses Ansible within a Python virtual environment to run playbooks that create a parallel file system of the size requested by the end user’s job submission. In addition to creating the file system, the Orchestrator then mounts (using restricted sudo privileges) the newly created buffer onto the compute nodes allocated by Slurm and the job is set to a RUNNING state. The job’s environment contains the path to the file system which the user references to access the namespace. Once the user job completes or is terminated, Slurm calls out using dacctl to unmount the file system on the designated compute nodes, moves the data to the central data storage, tears down the burst buffer, and deallocates the devices in etcd. The optional staging of data is controlled by variable in the batch script like so:
#DW stage_in type=directory source=/home/user/data-in/ destination=$DW_JOB_STRIPED/data #DW stage_out type=directory source=$DW_JOB_STRIPED/data destination=/home/user/data-out/ |
In the current release of DAC with Lustre
, each server acts as the MDS, OSS, and client (for data ingress/egress), with each SSD partitioned to contain an MDT and OST. The MDT size is a dacd
configurable setting that an administrator can override based on anticipated inode requirements, the remainder of the space is allocated to the OST.
Table 1 lists the specifications used in this reference architecture.
Table 1 Technical Specifications
Specifications |
|
Server Configuration |
|
Component |
Details |
Server Model |
|
Processor |
|
Memory |
|
System |
|
Local Disks (Storage) |
|
Network Adapter |
|
IPMI |
|
Software Configuration |
|
Component |
Details |
Operating System |
|
Kernel |
|
InfiniBand |
|
Orchestrator |
|
Batch Scheduler |
|
Resource Manager |
|
Parallel Filesystem |
|
Virtual Environment |
|
Network Configuration |
|
Component |
Details |
Storage |
|
Administration |
|
DAC’s device allocation is dynamic as the filesystem layout is based on the burst buffer capacity requested by the end user. A base DAC configuration requires only a single server, but to demonstrate scaling, this blog includes two and four server configurations as well. These three server quantities were chosen for this demonstration based on the available maximum of eight compute nodes. All compute nodes are used with all three server quantities that cause different levels of network and device ratios. Each iteration within the test series increases the number of processes by powers of 2 from 1 to 512 (divided between all compute nodes). Identical IOZone commands were used while only changing the quantity of DAC servers. The IOPs shown are 4K block random access, not metadata operations implied.
Each chart’s x-axis includes a data table summary with raw results. The contents are represented as follows:
Lines |
1 MB Sequential Block I/O performance of OSTs |
Write (Blue) |
Columns |
4K Random Block I/O performance of OSTs |
Write (Gray) |
Max Line |
Maximum theoretical network bandwidth |
White |
Some notes here are that these results are all based on system default settings and using a standard set of IOZone options used across Dell EMC storage solutions for comparative purposes. Lustre optimizations and NUMA binding of will be performed later. All tests in this demonstration use the DirectIO option to avoid effects from caching features intrinsic in the file system. Also, the SSD capacity granularity in dacd
was kept at 1.4 TB, lower than available capacity, to allow for other tests involving MDT sizing.
|
-i 0=Write, 1=Read, 2=Random Access Test
-c Include file closure times
-e Includes flush in timing calculations
-w Does not unlink (delete) temporary files
-r Record size
-s File size (1024g / num_proc)
-t Number of processes
-+n No retest
-+m Machine file
-I Use O_DIRECT, bypass client cache
-O Give results in ops/sec
One server has a maximum theoretical performance of 25 GB/s with two HDR100 ports indicated by the white line. The eight computes nodes with a single EDR port can saturate all ports on the server by 2:1. The sequential bandwidth reached ~23.9 GB/s Write and ~23.5 GB/s Read with the aggregate 128 processes or 128/8 = 16 processes per compute node. The random block operations approached sustained performance at 128 processes. The test was initiated by only activating the dacd
service on a single server, requesting a buffer size equal to the aggregate of all 12 SSD’s configured capacity, and requesting all eight compute nodes. This ratio of server to compute nodes appears sufficient to saturate available performance.
$ sbatch --nodes 8 --job-name="1server" --bb "capacity=33600GiB access_mode=striped type=scratch" IOZone_cli.batch |
Chart 1 One Server Performance
Building on the one server configuration, the same set of IOZone tests were run. The only change being that one additional DAC server was activated and added into the unallocated buffer pool. The ratio of network interfaces between servers and clients are now 1:1. Sequential bandwidth was again saturated around ~95% with 47.6 GB/s Write and 47.1 GB/s Read. With only 512 processes, the random operations appear to start leveling off, but additional compute nodes and processes are needed to find the sustained rate. The test was initiated by only activating the dacd
service on two servers, requesting a buffer size equal to the aggregate of all 24 SSD’s configured capacity, and requesting all eight compute nodes. This ratio of servers to computes appears sufficient to saturate network bandwidth, but insufficient to sustain a saturation level for IOPs.
$ sbatch --nodes 8 --job-name="2server" --bb "capacity=67200GiB access_mode=striped type=scratch" IOZone_cli.batch |
Chart 2 Two Server Performance
Again, building on the one server configuration, the same set of IOZone tests were launched. The only change being that three additional DAC servers are activated and added into the unallocated buffer pool. The ratio of the sixteen server and eight client network interfaces are now 0.5:1. Sequential bandwidth maxed at ~92.7GB/s Writes and ~76.8 GB/s Reads. The bandwidth is bound by the number of clients, and processes with insights explained later in this blog. Based on previous results, four to eight additional clients will approach sequential bandwidth saturation. Random operations continue on an upward trend, but do not achieve sustained levels with only 512 processes. More compute nodes and processes are needed to find the sustained levels of random operations. Perhaps 32-64 compute nodes with a couple thousand processes would be ideal to observe the trend for block IOPs. This test was initiated by only activating the dacd
service on four servers, requesting a buffer size equal to the aggregate configured capacity of 96 SSDs, and requesting all eight compute nodes.
$ sbatch --nodes 8 --job-name="4server" --bb "capacity=134400GiB access_mode=striped type=scratch" IOZone_cli.batch |
Chart 3 Four Server Performance
This section is meant to provide a preliminary explanation of some performance aspects.
Read performance and PCI Express:
At process counts of 16 and greater we observe read performance lower than write. This is due to the Non-Posted Operation in the PCIe read, requiring two transactions: request and complete, as opposed to the single transaction of PCIe write operation. Once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. The Posted write operation consists of a request only. The phenomenon is decreased when more requests are in flight to compensate for the average delay in latency within the split transactions. The read latency effect is prominent as more storage resources are added while the quantity of compute nodes and processes stay fixed.
PCIe Backplane:
Each server contains two PLX PCIe switches, each switch connects 12 Intel P4610 NVMe SSDs to each CPU on x16 PCIe. Based on PCIe lanes, theoretically four NVMe devices match the PCIe expander lanes and avoid over-subscription. Based on bandwidth, five devices (@3.2 GB/s which is the P4610 raw performance read|write) will saturate what is available for each PCIe switch. These aspects are to be considered along with the number of active burst buffers and workload type as one size does not fit all.
Based on the results of One Server and Two Server tests, the performance scales linearly. The Four Server test supports this trend. The linear scaling is a factor to consider, but we’ll cover other aspects that are worth mentioning.
A few examples to consider for scaling will be centered around workload type: capacity, sequential performance, and random-access performance. Regarding capacity, some highly sequential workloads may only need access to higher throughput storage with sufficient space to contain large amounts of transient data. In this case, the configuration can fill all 24 storage slots of the 740xd with larger capacity drives. For workloads that require consistent high sequential throughput, network interfaces are increased by adding DAC servers to meet the performance expectations based on the baseline single server performance results. In this case, it is a factor of an estimated bandwidth, 45 GB/s can be provided by two DAC servers with four HDR interfaces. Workloads with random access operations take into consideration the total quantity of compute nodes accessing the servers simultaneously. The baseline results with DirectIO indicate that a smaller quantity of servers, network interfaces, and drives are sufficient for sequential I/O, but cluster size will contribute in scaling for the random operations. In the IOZone results above, more than eight compute nodes are required to saturate the IOPs of four DAC servers.
There is also the possibility of hybrid configurations which require consideration of all three workload types. Those server adjustments are based ultimately on the factors explained above which requires consideration of rack space, network switch space, and cost.
For users requesting a burst buffer, they require some preliminary capacity estimates. Since the filesystem resources are allocated when the job is marked next for launch, a resize of the buffer is not possible. It is important to request a high enough capacity to prevent a job failure from Out Of Space errors and low enough to not waste resources. Prior to using the buffer, running the Linux "du -s" tool on input directories can help with estimating the capacity needed to stage into the buffer. Periodic capture of this data during a sample job will also help gauge consumed capacity. Alternatively, if the source filesystem supports quota, then the user can also query that information periodically during their job. Usually, the user needs an aggregate of the capacity of data staged in and the data produced for staging back into the central data storage.
Another consideration is for user workloads known to exceed inode availability. When more inodes are required, a larger buffer size allocates additional SSDs and inherently scale up the metadata capacity. If this type of condition is frequent, then the administrator can adjust the MDT sizing as required to optimize the balance between capacity and inodes. An example is a high rate of create and deletes in the order of millions of files that the job can exceed. Another example would be unpacking several large source codes that would not be obvious in calculating the inode space required.
In this blog we have demonstrated that a single DAC server with 24 1.6TB NVMe SSD drives can produce a IOZone DirectIO baseline performance of ~23.9 GB/s sequential write, ~23.5 GB/s sequential read, ~346K random write, and ~478K random read of block IO (non-metadata). The transient nature of multiple burst buffers allows the solution to avoid the overhead of RAID features and maximize both performance and usable capacity of 38 TB.
Next steps for this solution are to expand on alternative configurations and upgrades of test hardware and software. More compute nodes will be added and the Mellanox OFED and Lustre version will be upgraded as well to a version determined at a future date. As these performance results provide a default baseline, additional efforts will be focused on performance optimizations to eliminate the impact of NUMA misses.
The dependencies listed above allow for an expanded performance profile with the additional of compute nodes for the following tests:
IOZone: with and without DirectIO
IOR: Nto1 (Single Shared File)
MDTest: for metadata OPs
This article was written by Ari Martinez, HPC and AI Innovation Lab, June 2020