Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.
Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Accelerating Genomic Data Analysis With NVIDIA Clara Parabricks With The Dell EMC DSS 8440 Server & NVIDIA T4 GPUs

Summary: This article provides information about accelerating genomic data analysis using NVIDIA Parabricks on Dell EMC DSS 8440 with NVIDIA T4 GPU's.

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Instructions

Overview

The first step for processing Next Generation Sequencing (NGS) data is called Primary Analysis. This step is specific to the sequencing instrument and generates multiple FASTQ files containing sequencing reads. In the next step, known as Secondary Analysis, the FASTQ sequencing reads are mapped to a reference genome or a reference transcriptome. Additional processing identifies variants, or differences, between the sample of interest and a reference. The variants are annotated and interpreted in subsequent downstream steps. The secondary analysis time for a single sample ranges from hours to days, depending on data size, available computing resources, software, and analytical workflow. 

Secondary analysis is a computing and storage-intensive process, especially when processing hundreds to thousands of genomes. Many strategies exist to avoid secondary analysis bottlenecks. Until recently, the adoption of hardware acceleration using GPUs or FPGAs remained low due to customized software required by hardware accelerators. Parabricks’ genomics software which was acquired by NVIDIA in 2019, has pioneered a software stack performing various genomic analysis workflows with GPUs. We tested Parabricks with Dell EMC PowerEdge C4140/4x NVIDIA® Tesla® V100 GPUs about two years ago. Dell introduced many technological advances in its servers and storage solutions and NVIDIA Clara Parabricks has released robust versions with enhaced acceleration and the addition of variant callers. For example, a multi-GPU server design based on the Dell EMC DSS 8440 server with NVIDIA® Tesla® T4 GPUs looked promising for accelerating secondary analysis while offering an attractive balance between price and performance.  This blog reports a new reference architecture and benchmark results for NVIDIA Clara Parabricks secondary analysis on a multi-Tesla® T4 GPU, DSS 8440 server with  Dell EMC Isilon F800 storage. 

Reference Architecture

Figure 1 illustrates the tested reference architecture. The architecture is modular and easy-to-scale. The NVIDIA Clara Parabricks application software uses one or more GPUs making scale-out as simple as possible. The hardware building blocks consist of Dell EMC PowerEdge R640 as a management node, DSS 8440 server for GPU computing, and Dell EMC Isilon F800 storage.  


Figure 1 Reference architecture tested
 


DSS 8440, 2 sockets, 4U server can take up to 10 industry-leading NVIDIA® Tesla® V100S Tensor Core GPUs, up to 10 NVIDIA® Quadro RTX™ GPUs, or up to 16 NVIDIA Tesla T4 GPUs providing tremendous horsepower. The detailed configuration of DSS 8440 is listed in Table 1.

 
Dell EMC DSS 8440
CPU 2x Xeon® Gold 6248R 24 cores 3.0 GHz
RAM 24x 64GB at 2933 MTps
Operating System Red Hat Enterprise Linux Server release 7.4 (Maipo)
BIOS System Profile Performance Optimized
Logical Processor Disabled
Virtualization Technology Disabled
Accelerators 16x NVIDIA® Tesla® T4 GPUs
Parabricks v3.0.0.05

Two Z9100-ON switches provided the interconnect between the compute node and the Isilon F800 storage cluster. An additional switch, N2248X-ON, is used for management.
 

NGS Data

Data for benchmarking secondary analysis runtime consisted of three Human, whole-genome sequencing (WGS) data sets, ERR091571, SRR3124837, and  ERR194161, representing 10x, 30x, and 50x, sample coverage respectively. These data sets are available at the European Nucleotide Archive (ENA).

Performance Evaluation

Software Improvements Reduce Runtime
NVIDIA continues to introduce software improvements to NVIDIA Clara Parabricks. Figure 2 shows the runtime reduction between two versions of the Parabricks executing the germline pipeline using the Dell PowerEdge C4140 server with 4x V100 GPUs test environment. Moving from v2.1.0 to v3.0.0 reduced the runtime by 42%.


Figure 2 Latest version of Parabricks germline variant calling pipeline runtime.

Performances of DSS 8440 with 16x T4s

The runtime for a NVIDIA Clara Parabricks secondary analysis using a single T4 GPU is approximately 30% slower than using one V100 GPU. However, two (2) T4 GPUs provide about 10% more TFLOPS than one (1) V100 GPU at approximately half the cost. The DSS 8440 provides up to 16 PCIe slots, which opens the possibility to design a T4 GPU based server that delivers similar runtime performance as a C4140 system with four V100 GPUs but at a lower cost.
The Parabricks germline analysis was performed using a PowerEdge DSS 8440 with 16 T4 GPUs. For each WGS sample data set described earlier, the runtime was recorded using 1, 2, 4, 8, and 16 T4 GPUs per secondary analysis. The results are plotted in Figures 3 through 5. Overall, the runtime does not scale linearly as the number of GPUs per analysis increases. The scaling pattern is similar to the amount of data per sample increases from 10x to 50x coverage. 
Although not presented here, an earlier Dell EMC investigation of  Parabricks runtime results using eight or more V100 GPUs per analysis did not scale as efficiently as the T4 GPUs. Additional testing demonstrated that 6 T4 GPUs generated runtime results nearly identical to 4 V100 GPUs.


Figure 3 Performance comparisons with 10x WGS
 


Figure 4 Performance comparisons with 30x WGS


Figure 5 Performance comparisons with 50x WGS


Conclusion

A DSS 8440 with sixteen T4 GPUs is capable of processing thirty 50x Human genomes per day. A similar daily analysis throughput using a traditional x86 CPU architecture requires ten PowerEdge C6420 compute nodes. The complete architecture is discussed in a previous Dell publication.  
However, dedicating all 16 T4 GPUs to process one sample offers little benefit as using 16 GPUs per analysis is at best 10% faster than using 8 GPUs. The design of the DSS 8440 allows multiple secondary analyses in parallel. By assigning eight T4 GPUs per sample, the daily analysis throughput increases to ~50 genomes per day. Using four GPUs per sample increases the analysis throughput to ~70 genomes per day. More importantly, this daily output using T4 GPUs is less than half the cost of using a V100 GPU design.
In addition to speed, compatibility with other analysis tools is essential for the comparability of results. The Parabricks germline analysis results are nearly identical to the well known BWA-GATK Haplotype caller analysis from prior testing.  We wanted to also compare the Parabricks variant calling results to other toolsets like samtools/mpileup.  These two completely different tools reach ~90% overall agreement for identified variants, and variations in many well-known genomic regions containing important genes agree more than 99%.

Article Properties


Affected Product

DSS 8440, Isilon F800, Poweredge C4140, PowerEdge R640

Last Published Date

03 Dec 2020

Version

1

Article Type

How To