The Data Accelerator (DAC) is a non-proprietary solid-state burst buffer collaboration with the University of Cambridge, Dell EMC, Intel, and StackHPC. The goal is to significantly boost data intensive workloads that exceed the bandwidth and IO operations capabilities of magnetic based local or centralized network storage. The DAC integrates with the job scheduler to allow multiple users to access their own temporary burst buffer based on requested capacity and available resources. Workflows can be further enhanced by providing access to features offered by high performance parallel filesystems. The transient nature of this solution offers other possible use cases, such as the isolation of "noisy neighbor" workloads that consume resources and interfere with other users within the cluster ecosystem.

This technical blog discusses the reference architecture, performance results from initial IOZone benchmarks, preliminary insights, and future work.

Reference Architecture

Hardware

In Figure 1 below, the base hardware configuration for this DAC reference architecture is 1 Dell EMC PowerEdge R740xd with 24 Intel P4610 solid-state drives, and 2 Mellanox InfiniBand ConnectX-6 HDR100 adapters connected to a Mellanox Quantum HDR switch with passive copper HDR100 "splitter" cables. Identical server configurations are added based on capacity and performance requirements. The area within the green-dotted frame in Figures 1 and 2 indicate the DAC components that integrate into an existing HPC cluster.

The Technical Specifications section lists the details of the hardware. The high-level summary of core components in this example include:

Job Scheduler: Slurm resides on a PowerEdge R640 with Bright8.1 Cluster Manager
Compute Nodes: 8 x Dell C6420 with Mellanox ConnectX-5 EDR
DAC Servers: 4 x Dell EMC PowerEdge R740xd
High Speed Network: 1 x Mellanox HDR MQM8700-HS2R Switch
Management/Administration Network: 1 x Dell EMC PowerConnect Gigabit Switch

SLN321708_en_US__1image(18639)

Figure 1 Hardware Architecture

Software

Figure 2 below extends upon the hardware layout of Figure 1 above with a high-level logical workflow of the Data Accelerator software. The resource management component is provided by etcd, which is an open-source distributed key-value store. The dacctl (Slurm interface) and dacd (DAC Server) process interface with etcd for state cohesion as SSDs and DAC servers are allocated before buffer creation and released after buffer teardown. The DAC controls integrate with the Slurm burst buffer DataWarp plugin. End users submit a job on a login node via conventional Slurm mechanisms like sbatch and srun which contain the interface directives (examples provided in Performance section). These requests activate the burst buffer features of the Slurm plugin. A callout is issued via the dacctl command line tool which will in turn activate the Orchestrator mechanisms to the dacd daemon on the DAC servers.

There are two lifecycles that DAC burst buffers have available:

Per Job: a namespace created for one user job
Persistent: a namespace created and shared across multiple jobs designated by the same user

In addition to the lifecycles, there are three modes of operation for each buffer:

Striped: a single global namespace is mounted on all allocated compute nodes via a directory containing the job id number
Private: a namespace is granted to all allocated nodes via a subdirectory for each compute node
Swap: a loopback file created to extend memory (under development)

For this reference architecture we’ll cover the use of Per Job Lustre filesystems in Striped mode accessed by the $DW_JOB_STRIPED environment variable. As the buffer lifecycles are intended to enhance performance of existing central storage and be transient, no drive redundancy features are used. This ensures maximum utilization of SSD capacity as well as eliminating overhead of RAID calculation.

SLN321708_en_US__2image(18640)

Figure 2 Software Architecture

Administrators manually mount any existing filesystems on the DAC servers for the optional stage in (SI/ingress) and stage out (SO/egress) of user data. Once Slurm determines the next job to run, it calls out the request to the Orchestrator. While the job remains in a PENDING state, dacd initiates the creation of a transient Lustre filesystem (burst buffer) along with the optional stage in of data specified by the user. The Orchestrator uses ansible within a Python virtual environment to run playbooks that will create a parallel filesystem of the size requested by the end user’s job submission. In addition to creating the filesystem, the Orchestrator then mounts (via restricted sudo privileges) the newly created buffer onto the compute nodes allocated by Slurm and the job is set to a RUNNING state. The job’s environment contains the path to the filesystem which the user references to access the namespace. Once the user job completes or is terminated, Slurm calls out via dacctl to unmount the filesystem on the designated compute nodes, moves the data to the central data storage, tears down the burst buffer, and deallocates the devices in etcd. The optional staging of data is controlled by variable in the batch script like so:

#DW stage_in type=directory source=/home/user/data-in/ destination=$DW_JOB_STRIPED/data

#DW stage_out type=directory source=$DW_JOB_STRIPED/data destination=/home/user/data-out/

In the current release of DAC with Lustre, each server acts as the MDS, OSS, and client (for data ingress/egress), with each SSD partitioned to contain an MDT and OST. The MDT size is a dacd configurable setting that an administrator can override based on anticipated inode requirements, the remainder of the space is allocated to the OST.

Technical Specifications

Table 1 lists the specifications used in this reference architecture.

Table 1 Technical Specifications

Specifications
Server Configuration
Component	Details
Server Model	Dell EMC PowerEdge R740xd
Processor	2 x Intel Xeon 6248R 3Ghz, 24C
Memory	24 x 16GB RDIMM, 2933MT/s, Dual Rank
System	HBA330 Controller Adapter, Low Profile BOSS-S1 Controller 2 x 240GB M.2 Performance Optimized
Local Disks (Storage)	24 x Dell Express Flash NVMe P4610 1.6TB SFF
Network Adapter	2 x Mellanox ConnectX-6 HDR100
IPMI	iDRAC Enterprise

Software Configuration
Component	Details
Operating System	RedHat Enterprise Linux 7.6
Kernel	3.10.0-957
InifniBand	Mellanox OFED 4.6-1.0.1.0
Orchestrator	data-acc 2.3
Batch Scheduler	slurm 19.05
Resource Manager	etcd 3.2
Parallel Filesystem	lustre 2.12.2
Virtual Environment	python 2.7.5

Network Configuration
Component	Details
Storage	Mellanox Quantum HDR MQM8700-HS2R
Administration	PowerConnect Gigabit Ethernet 3048-ON

Performance Evaluation

Summary of Testing Methodology

DAC’s device allocation is dynamic as the filesystem layout is based on the burst buffer capacity requested by the end user. A base DAC configuration requires only a single server, but to demonstrate scaling, this blog includes two and four server configurations as well. These three server quantities were chosen for this demonstration based on the available maximum of eight compute nodes. All compute nodes are used with all three server quantities that cause different levels of network and device ratios. Each iteration within the test series increases the number of processes by powers of 2 from 1 to 512 (divided between all compute nodes). Identical IOZone commands were used while only changing the quantity of DAC servers. The IOPs shown are 4K block random access, not metadata operations implied.

Each chart’s x-axis includes a data table summary with raw results. The contents are represented as follows:

Lines	1MB Sequential Block IO performance of OSTs	Write (Blue) Read (Orange)
Columns	4K Random Block IO performance of OSTs	Write (Gray) Read (Yellow)
Max Line	Maximum theoretical network bandwidth	White

Some notes here are that these results are all based on system default settings and using a standard set of IOZone options used across Dell EMC storage solutions for comparative purposes. Lustre optimizations and NUMA binding of will be performed later. All tests in this demonstration use the DirectIO option to avoid effects from caching features intrinsic in the filesystem. Also, the SSD capacity granularity in dacd was kept at 1.4 TB, lower than available capacity, to allow for other tests involving MDT sizing.

SEQUENTIAL WRITE: iozone -i 0 -c -e -I -w -r 1M -s $ g -t $ -+n -+m $

SEQUENTIAL READ: iozone -i 1 -c -e -I -w -r 1M -s $ g -t $ -+n -+m $

RANDOM WRITE/READ: iozone -i 2 -c -I -w -O -r 4K -s $ g -t $ -+n -+m $

-i 0=Write, 1=Read, 2=Random Access Test

-c Include file closure times

-e Includes flush in timing calculations

-w Does not unlink (delete) temporary files

-r Record size

-s File size (1024g / num_proc)

-t Number of processes

-+n No retest

-+m Machine file

-I Use O_DIRECT, bypass client cache

-O Give results in ops/sec

IOZone - One Server

One server has a maximum theoretical performance of 25 GB/s with two HDR100 ports indicated by the white line. The eight computes nodes with a single EDR port can saturate all ports on the server by 2:1. The sequential bandwidth reached ~23.9 GB/s Write and ~23.5 GB/s Read with the aggregate 128 processes or 128/8 = 16 processes per compute node. The random block operations approached sustained performance at 128 processes. The test was initiated by only activating the dacd service on a single server, requesting a buffer size equal to the aggregate of all 12 SSD’s configured capacity, and requesting all 8 compute nodes. This ratio of server to compute nodes appears sufficient to saturate available performance.

$ sbatch --nodes 8 --job-name="1server" --bb "capacity=33600GiB access_mode=striped type=scratch" IOZone_cli.batch

SLN321708_en_US__3image(18641)

Chart 1 One Server Performance

IOZone - Two Server

Building on the one server configuration, the same set of IOZone tests were run. The only change being one additional DAC server was activated and added into the unallocated buffer pool. The ratio of network interfaces between servers and clients are now 1:1. Sequential bandwidth again saturated around ~95% with 47.6 GB/s Write and 47.1 GB/s Read. With only 512 processes, the random operations appear to start leveling off, but additional compute nodes and processes are needed to find the sustained rate. The test was initiated by only activating the dacd service on 2 servers, requesting a buffer size equal to the aggregate of all 24 SSD’s configured capacity, and requesting all 8 compute nodes. This ratio of servers to computes appears sufficient to saturate network bandwidth, but insufficient to sustain a saturation level for IOPs.

$ sbatch --nodes 8 --job-name="2server" --bb "capacity=67200GiB access_mode=striped type=scratch" IOZone_cli.batch

SLN321708_en_US__4image(18642)

Chart 2 Two Server Performance

IOZone - Four Server

Again, building on the one server configuration, the same set of IOZone tests were launched. The only change being three additional DAC servers are activated and added into the unallocated buffer pool. The ratio of the sixteen server and eight client network interfaces are now 0.5:1. Sequential bandwidth maxed at ~92.7GB/s Writes and ~76.8 GB/s Reads. The bandwidth is bound by the number of clients and processes with insights explained later in this blog. Based on previous results, four to eight additional clients will approach sequential bandwidth saturation. Random operations continue on an upward trend, but don’t achieve sustained levels with only 512 processes. More compute nodes and processes are needed to find the sustained levels of random operations. Perhaps 32-64 compute nodes with a couple thousand processes would be ideal to observe the trend for block IOPs. This test was initiated by only activating the dacd service on 4 servers, requesting a buffer size equal to the aggregate configured capacity of 96 SSDs, and requesting all 8 compute nodes.

$ sbatch --nodes 8 --job-name="4server" --bb "capacity=134400GiB access_mode=striped type=scratch" IOZone_cli.batch

SLN321708_en_US__5image(18638)

Chart 3 Four Server Performance

Insights

This section is meant to provide a preliminary explanation of some performance aspects.

Read performance and PCI Express:

At process counts of 16 and greater we observe read performance lower than write. This is due to the Non-Posted Operation in the PCIe read, requiring two transactions: request and complete, as opposed to the single transaction of PCIe write operation. Once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. The Posted write operation consists of a request only. The phenomenon is decreased when more requests are in flight to compensate for the average delay in latency within the split transactions. The read latency effect is prominent as more storage resources are added while the quantity of compute nodes and processes stay fixed.

PCIe Backplane:

Each server contains two PLX PCIe switches, each switch connects 12 Intel P4610 NVMe SSDs to each CPU on x16 PCIe. Based on PCIe lanes, theoretically four NVMe devices match the PCIe expander lanes and avoid over-subscription. Based on bandwidth, five devices (@3.2 GB/s which is the P4610 raw performance read|write) will saturate what is available for each PCIe switch. These aspects are to be considered along with number of active burst buffers and workload type as one size doesn’t fit all.

Scalability

Based on the results of One Server and Two Server tests, the performance scales linearly. The Four Server test supports this trend. The linear scaling is a factor to consider, but we’ll cover other aspects that are worth mentioning.

A few examples to consider for scaling will be centered around workload type: capacity, sequential performance, and random-access performance. Regarding capacity, some highly sequential workloads may only need access to higher throughput storage with sufficient space to contain large amounts of transient data. In this case, the configuration can fill all 24 storage slots of the 740xd with larger capacity drives. For workloads that require consistent high sequential throughput, network interfaces are increased by adding DAC servers to meet the performance expectations based on the baseline single server performance results. In this case, it’s a factor of an estimated bandwidth, 45 GB/s can be provided by two DAC servers with four HDR interfaces. Workloads with random access operations will take into consideration the total quantity of compute nodes accessing the servers simultaneously. The baseline results with DirectIO indicate a smaller quantity of servers, network interfaces, and drives are sufficient for sequential IO, but cluster size will contribute in scaling for the random operations. In the IOZone results above, more than eight compute nodes are required to saturate the IOPs of four DAC servers.

There is also the possibility of hybrid configurations which require consideration of all three workload types. Those server adjustments will be based ultimately on the factors explained above which requires consideration of rack space, network switch space, and cost.

Best Practices

For users requesting a burst buffer, they will require some preliminary capacity estimates. Since the filesystem resources are allocated when the job is marked next for launch, a resize of the buffer is not possible. It’s important to request a high enough capacity to prevent a job failure from Out Of Space errors and low enough to not waste resources. Prior to using the buffer, running the Linux "du -s" tool on input directories can assist in estimating the capacity needed to stage into the buffer. Periodic capture of this data during a sample job will also help gauge actively consumed capacity. Alternatively, if the source filesystem supports quota, then the user can also query that information periodically during their job. In most cases, the user will need an aggregate of the capacity of data staged in and the data produced for staging back into the central data storage.

Another consideration is for user workloads known to exceed inode availability. When more inodes are required, a larger buffer size will allocate additional SSDs and inherently scale up the metadata capacity. If this type of condition is frequent, then the administrator can adjust the MDT sizing as required to optimize the balance between capacity and inodes. An example is a high rate of create and deletes in the order of millions of files that the job can exceed. Another example would be unpacking several large source codes that wouldn’t be obvious in calculating inode space required.

Conclusion and Future Work

In this blog we’ve demonstrated that a single DAC server with 24 1.6TB NVMe SSD drives can produce a IOZone DirectIO baseline performance of ~23.9 GB/s sequential write, ~23.5 GB/s sequential read, ~346K random write, and ~478K random read of block IO (non-metadata). The transient nature of multiple burst buffers allows the solution to avoid the overhead of RAID features and maximize both performance and usable capacity of 38TB.

Next steps for this solution are to expand on alternative configurations and upgrades of test hardware and software. Additional compute nodes will be added and the Mellanox OFED and Lustre version will be upgraded as well to a version determined at a future date. As these performance results provide a default baseline, additional efforts will be focused on performance optimizations to eliminate impact of NUMA misses.

The dependencies listed above allow for an expanded performance profile with the additional of compute nodes for the following tests:

Dell EMC Data Accelerator Reference Architecture

Article Content

Symptoms

Resolution

Introduction