High Performance Secondary Analysis of Genomic Data

Summary: HPC High Performance Computing, HPC and AI Innovation Lab , Variant calling, BWA-GATK, BWA, GATK, HaplotypeCaller, Mutect2, CNVKit, Google DeepVariant, PowerEdge C4140

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

Article written by Ankit Sethia of Parabricks and Kihoon Yoon of HPC and AI Innovation Lab in October 2018

Resolution

This blog post describes Parabricks NGS secondary analysis on a Dell PowerEdge server.

Advancements in Next Generation Sequencing (NGS) technologies have jump started the personalized medicine revolution where medical treatment can be customized based on a patient’s DNA. This is driving increased research and clinical applications. As a result, the number of human genomes sequenced is predicted to double every year and transform the diagnosis and treatment of diseases, leading to a disruptive change in modern medicine.

Parabricks brings high performance computing technologies that are tailored for NGS analyses and accelerates the standard NGS software from several days to approximately one hour. The accelerated software is a drop-in replacement of existing tools that does not sacrifice output accuracy or configurability. Parabricks provides 30-50 times faster secondary analysis of FASTQ files coming out of sequencer to variant call files (VCFs) for tertiary analysis. The standard pipeline shown below consists of three steps and are defined as the Genome Analysis Toolkit (GATK). Parabricks accelerates existing GATK 4 best practices to generate equivalent results as the baseline. The image below (Figure 1) shows the pipeline currently supported by Parabricks.

SLN314233_en_US__2image(6290)

Figure 1 Parabricks GPU accelerated pipeline

The Fastq files that come out of the sequencer along with the reference genome are input to the GPU accelerated BWA-Mem alignment. The aligned output is then coordinate sorted, followed by marking the duplicates. This is the first output of the standard pipeline in binary alignment map (BAM) format. This BAM file is then used for base quality score recalibration (BQSR) followed by updating the base qualities of the BAM by using Apply BQSR. Finally, a variant caller is used depending on the task at hand. Parabricks has accelerated several variant callers: GATK Haplotypecaller, GATK Mutect2, and CNVKit; and, Google DeepVariant is in the development phase.

Dell Hardware Configuration

The PowerEdge C4140 Server is an accelerator optimized server with support for two Intel Xeon Scalable processors and four NVIDIA Tesla GPUs (PCIe or NVLink) in a 1U form factor. The tested server equipped with the PCIe version of GPUs (standard PCIe Gen3 connections between GPU to CPU) and configured with GPU configuration B (shown in Figure 2 below) from the choices of four different Configurations: B, C, K, and G. The hardware and system software configurations are summarized below.

Table 1 Hardware Configuration

Server	Dell EMC PowerEdge C4140
Processor	Intel Xeon Gold 6148. 20 cores, 2.40 GHz
Memory	384 GB @ 2667 MTps
GPU	NVIDIA V100-16GB PCIe
Storage	1x Samsung Electronics Co Ltd NVMe SSD Controller 172Xa (rev 01), 1.2TB
Power Supplies	Dual 2000W

Table 2 Software/Firmware Configuration

Component	Version
BIOS	1.1.7
OS	Red Hat Enterprise Linux 7.4
Kernel	3.10.0-693.17.1.el7.x86_64
System Profile	Performance optimized (Turbo enabled, C-States disabled, Power management set to Max Performance)
CUDA Driver	390.46
CUDA Toolkit	9.1
Compilers	gcc- 4.8.5 , OpenMPI – 1.10.2
Intel MKL	From Intel Parallel Studio 2017

Figure 2 PowerEdge C4140 in Configuration B with 4x V100

Performance Evaluation

Secondary analysis of genomic data can on a c3.8xlarge AWS node, for a 30x WGS data can take upto 30-40 hours for running the pipeline shown before using HaplotypeCaller for variant calling. Below, the raw run times in minutes for the Parabricks software on a Dell EMC PowerEdge C4140 are presented for 3 DNA samples with different coverages (10x, 38x, 53x).

Table 3 Others include Co-ordinate sorting, marking duplicates, bqsr and applybqsr

Benchmark	Coverage	BWA-Mem	Others*	HaplotypeCaller	Total
ERR091571	10X	16.5	6	7.5	30
SRR12837	38X	61	14.5	14	89.5
ERR194161	53X	89	23.5	20	132.5

Figure 3 Variant calling pipeline benchmark on 3 different DNA samples

Throughput Evaluation

The Parabricks GPU solution with 4 V100 GPUs on a Dell PowerEdge C4140 Server showed significantly improved throughput. One such server can analyze 48 whole genomes at 10x coverage per day. In comparison, a similar CPU-only solution can process only about 8 genomes per day. This 6-fold increase in throughput with the Parabricks GPU solution results in large savings in the Total Cost of Ownership by reducing hardware, IT management, cooling, power, and maintenance costs for centers processing large volumes of genomic data.

Features of Parabricks software

25-30 times faster analysis: Compared to a CPU-only solution, Parabricks accelerates secondary analysis by orders of magnitude.
100% Deterministic and Reproducible: Parabricks software, regardless of platform and number/type of resources, generates the exact same results every execution.
Equivalent Results: Parabricks’ pipeline generates equivalent results as the reference Broad Institute GATK 4 best practices pipeline as the same algorithm is used.
Up to Date Support of All Tool Versions: Parabricks’ accelerated software supports multiple versions of BWA-Mem, Picard and GATK and will support all future versions of these tools.
Visualization: Parabricks generates several key visualizations real-time, while performing secondary analysis that can improve the user’s understanding of the data.
Single Node Execution: The entire pipeline is run using one computing node and does not incur any overhead of distributing data and work across multiple servers.
Turnkey Solution: Parabricks software runs on standard CPU and GPU nodes available on the cloud or on-premise, and requires no additional setup steps by the user.
On-Premise and Cloud: Parabricks software can run on local servers, AWS, Google Cloud, and Azure.

Please contact info@parabricks.com for further information.

Affected Products

High Performance Computing Solution Resources, Poweredge C4140

Article Number: 000142563

Article Type: Solution

Last Modified: 21 Feb 2021

Version: 3

Check if your device is covered by Support Services.

High Performance Secondary Analysis of Genomic Data

Summary: HPC High Performance Computing, HPC and AI Innovation Lab , Variant calling, BWA-GATK, BWA, GATK, HaplotypeCaller, Mutect2, CNVKit, Google DeepVariant, PowerEdge C4140

Symptoms

Resolution

Dell Hardware Configuration

Performance Evaluation

Throughput Evaluation

Features of Parabricks software

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

High Performance Secondary Analysis of Genomic Data

Summary: HPC High Performance Computing, HPC and AI Innovation Lab , Variant calling, BWA-GATK, BWA, GATK, HaplotypeCaller, Mutect2, CNVKit, Google DeepVariant, PowerEdge C4140

Detailed Article

Symptoms

Resolution

Affected Products

Symptoms

Resolution

Dell Hardware Configuration

Performance Evaluation

Throughput Evaluation

Features of Parabricks software

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services