Dell EMC Ready Solutions for HPC Life Sciences: BWA-GATK Pipeline performance tests with BeeGFS

Dieser Artikel gilt für Dieser Artikel gilt nicht für Dieser Artikel ist nicht an ein bestimmtes Produkt gebunden. In diesem Artikel werden nicht alle Produktversionen aufgeführt.

Andere Ressourcen ansehen

Lösung

Overview

The purpose of this blog is to provide valuable performance information for BWA-GATK pipeline benchmark with Dell EMC Ready Solutions for HPC BeeGFS Storage. Unfortunately, we were not able to setup enough compute nodes and BeeGFS storage large enough to compare to the previous performance results published for a Lustre storage. However, the results will be helpful to estimate the amount of computational resource for a given variant calling workload.

The test cluster configurations are summarized in Table 1.

Table 1 Tested compute node configuration

Dell EMC PowerEdge C6420
CPU	2x Xeon® Gold 6248 20 cores 2.5 GHz (Cascade Lake)
RAM	12x 16GB at 2933 MTps
OS	Red Hat Enterprise Linux Server release 7.4 (Maipo)
Interconnect	Mellanox EDR InfiniBand
BIOS System Profile	Performance Optimized
Logical Processor	Disabled
Virtualization Technology	Disabled
BWA	0.7.15-r1140
Sambamba	0.7.0
Samtools	1.6
GATK	3.6-0-g89b7209

The tested compute nodes were connected to the BeeGFS storage via Mellanox EDR InfiniBand switches. The BeeGFS storage is connected to a bridge EDR switch, and this bridge is connected to additional EDR switch where all compute nodes are communicating. The summary configuration of the storage is listed in Table 2.

Table 2 BeeGFS solution hardware and software specifications

Component	Specification
Management server	1 x Dell EMC PowerEdge R640
MDS	2 x Dell EMC PowerEdge R740
Storage servers	2 x Dell EMC PowerEdge R740
Processors	Management server: Dual Intel Xeon Gold 5218 MDS and SS servers: Dual Intel Xeon Gold 6230
Memory	Management server: 12 x 8 GB 2666 MT/s DDR4 RDIMMs MDS and SS servers: 12 x 32 GB 2933 MT/s DDR4 RDIMMs
Local disks and RAID controller	Management server: PERC H740P Integrated RAID, 8GB NV cache, 6x 300GB 15K SAS hard drives (HDDs) configured in RAID10 MDS and SS servers: PERC H330+ Integrated RAID, 2x 300GB 15K SAS HDDs configured in RAID1 for OS
InfiniBand HCA	Mellanox ConnectX-6 HDR100 InfiniBand adapter
External storage controllers	On each MDS: 2 x Dell 12 Gb/s SAS HBAs On each SS: 4 x Dell 12 Gb/s SAS HBAs
Object storage enclosures	4 x Dell EMC PowerVault ME4084 fully populated with a total of 336 drives
Metadata storage enclosure	1 x Dell EMC PowerVault ME4024 with 24 SSDs
RAID controllers	Duplex RAID controllers in the ME4084 and ME4024 enclosures
HDDs	On each ME4084 Enclosure: 84 x 8 TB 3.5 in. 7.2 K RPM NL SAS3 ME4024 Enclosure: 24 x 960 GB SAS3 SSDs
Operating system	CentOS Linux release 8.1.1911 (Core)
Kernel version	4.18.0-147.5.1.el8_1.x86_64
Mellanox OFED version	4.7-3.2.9.0
BeeGFS file system version	7.2 (beta2)

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it actually reaches to ~53x.

Performance Evaluation

Multiple Sample/Multiple Nodes Performance

A typical way of running NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. The number of compute nodes used for the tests was eight C6420 compute nodes, and the number of samples per node was seven samples. Hence, up-to 56 samples are processed concurrently to estimate the maximum number of genomes per day without a job failure.

As shown in Figure 1, single C6420 compute node can process 3.69 of 50x whole human genomes per day when 7 samples are processed together. For each sample, 5 cores and 20 GB memory are allocated.

SLN322212_en_US__1firgure1-BeeGFS

Figure 1 Throughput tests with up-to 8x C6420s with BeeGFS

56 of 50x whole human genomes can be processed with 8 of C6420 compute nodes in ~54 hours. In other words, the performance of the test configuration summarizes as 25.11 genomes per day for whole human genome with 50x depth of coverage.

Conclusion

As the data size of WGS has been growing constantly. The current average size of WGS is about 55x. This is 5 times larger than a typical WGS 4 years ago when we started to benchmark BWA-GATK pipeline. The increasing data size does not strain storage side capacity since most applications in the pipeline are also bounded by CPU clock speed. Hence, the pipeline runs longer with larger data size rather than generating heavier IOs.

However, more temporary files are generated during the process due to the larger data needs to be parallelized, and this increased number of temporary files opened at the same time exhausts the open file limit in a Linux operating system. One of the applications silently fails to complete by hitting the limit of the number of open files. A simple solution is to increase the limit to >150K.

The results in Figure 1 shows that the throughput tests did not hit the maximum capacity of the system. Since there was not any sign of significant slowdown by adding more samples, it must be possible to process more than 7 samples if compute nodes are setup with larger memory. Overall, the BeeGFS storage is a suitable scratch storage for NGS data processing.

Artikelnummer: 000124151

Artikeltyp: Solution

Zuletzt geändert: 21 Feb. 2021

Version: 3

Prüfen Sie, ob Ihr Gerät durch Support Services abgedeckt ist.

Dell EMC Ready Solutions for HPC Life Sciences: BWA-GATK Pipeline performance tests with BeeGFS

Lösung

Artikeleigenschaften

Antworten auf Ihre Fragen erhalten Sie von anderen Dell NutzerInnen

Support Services

Artikeleigenschaften

Antworten auf Ihre Fragen erhalten Sie von anderen Dell NutzerInnen

Support Services

Willkommen

Willkommen bei Dell

Dell EMC Ready Solutions for HPC Life Sciences: BWA-GATK Pipeline performance tests with BeeGFS

Ausführlicher Artikel

Lösung

Lösung

Artikeleigenschaften

Antworten auf Ihre Fragen erhalten Sie von anderen Dell NutzerInnen

Support Services

Artikeleigenschaften

Antworten auf Ihre Fragen erhalten Sie von anderen Dell NutzerInnen

Support Services