Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products

High Performance Secondary Analysis of Genomic Data

Summary: HPC High Performance Computing, HPC and AI Innovation Lab , Variant calling, BWA-GATK, BWA, GATK, HaplotypeCaller, Mutect2, CNVKit, Google DeepVariant, PowerEdge C4140

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Article written by Ankit Sethia of Parabricks and  Kihoon Yoon of HPC and AI Innovation Lab in October 2018

SLN314233_en_US__1image(6199)

Resolution

This blog post describes Parabricks NGS secondary analysis on a Dell PowerEdge server.

Advancements in Next Generation Sequencing (NGS) technologies have jump started the personalized medicine revolution where medical treatment can be customized based on a patient’s DNA. This is driving increased research and clinical applications.  As a result, the number of human genomes sequenced is predicted to double every year and transform the diagnosis and treatment of diseases, leading to a disruptive change in modern medicine.

Parabricks brings high performance computing technologies that are tailored for NGS analyses and accelerates the standard NGS software from several days to approximately one hour. The accelerated software is a drop-in replacement of existing tools that does not sacrifice output accuracy or configurability.  Parabricks provides 30-50 times faster secondary analysis of FASTQ files coming out of sequencer to variant call files (VCFs) for tertiary analysis. The standard pipeline shown below consists of three steps and are defined as the Genome Analysis Toolkit (GATK). Parabricks accelerates existing GATK 4 best practices to generate equivalent results as the baseline. The image below (Figure 1) shows the pipeline currently supported by Parabricks.


SLN314233_en_US__2image(6290)
Figure 1 Parabricks GPU accelerated pipeline

The Fastq files that come out of the sequencer along with the reference genome are input to the GPU accelerated BWA-Mem alignment. The aligned output is then coordinate sorted, followed by marking the duplicates. This is the first output of the standard pipeline in binary alignment map (BAM) format. This BAM file is then used for base quality score recalibration (BQSR) followed by updating the base qualities of the BAM by using Apply BQSR. Finally, a variant caller is used depending on the task at hand. Parabricks has accelerated several variant callers: GATK Haplotypecaller, GATK Mutect2, and CNVKit;  and, Google DeepVariant is in the development phase.

Dell Hardware Configuration

The PowerEdge C4140 Server is an accelerator optimized server with support for two Intel Xeon Scalable processors and four NVIDIA Tesla GPUs (PCIe or NVLink) in a 1U form factor. The tested server equipped with the PCIe version of GPUs (standard PCIe Gen3 connections between GPU to CPU) and configured with GPU configuration B (shown in Figure 2 below) from the choices of four different Configurations: B, C, K, and G.  The hardware and system software configurations are summarized below.

Table 1 Hardware Configuration
Server Dell EMC PowerEdge C4140
Processor Intel Xeon Gold 6148. 20 cores, 2.40 GHz
Memory 384 GB @ 2667 MTps
GPU NVIDIA V100-16GB PCIe
Storage 1x Samsung Electronics Co Ltd NVMe SSD Controller 172Xa (rev 01), 1.2TB
Power Supplies Dual 2000W
 
Table 2 Software/Firmware Configuration
Component Version
BIOS 1.1.7
OS Red Hat Enterprise Linux 7.4
Kernel 3.10.0-693.17.1.el7.x86_64
System Profile Performance optimized (Turbo enabled, C-States disabled, Power
management set to Max Performance)
CUDA Driver 390.46
CUDA Toolkit 9.1
Compilers gcc- 4.8.5 , OpenMPI – 1.10.2
Intel MKL From Intel Parallel Studio 2017


SLN314233_en_US__3image(6291)
Figure 2 PowerEdge C4140 in Configuration B with 4x V100
 

Performance Evaluation

Secondary analysis of genomic data can on a c3.8xlarge AWS node, for a 30x WGS data can take upto 30-40 hours for running the pipeline shown before using HaplotypeCaller for variant calling. Below, the raw run times in minutes for the Parabricks software on a Dell EMC PowerEdge C4140 are presented for 3 DNA samples with different coverages (10x, 38x, 53x).
 
Table 3 Others include Co-ordinate sorting, marking duplicates, bqsr and applybqsr
Benchmark Coverage BWA-Mem Others* HaplotypeCaller Total
ERR091571 10X 16.5 6 7.5 30
SRR12837 38X 61 14.5 14 89.5
ERR194161 53X 89 23.5 20 132.5

SLN314233_en_US__4image(6292)
Figure 3 Variant calling pipeline benchmark on 3 different DNA samples

Throughput Evaluation

The Parabricks GPU solution with 4 V100 GPUs on a Dell PowerEdge C4140 Server showed significantly improved throughput. One such server can analyze 48 whole genomes at 10x coverage per day. In comparison, a similar CPU-only solution can process only about 8 genomes per day. This 6-fold increase in throughput with the Parabricks GPU solution results in large savings in the Total Cost of Ownership by reducing hardware, IT management, cooling, power, and maintenance costs for centers processing large volumes of genomic data.

Features of Parabricks software

  • 25-30 times faster analysis: Compared to a CPU-only solution, Parabricks accelerates secondary analysis by orders of magnitude.
  • 100% Deterministic and Reproducible: Parabricks software, regardless of platform and number/type of resources, generates the exact same results every execution.
  • Equivalent Results: Parabricks’ pipeline generates equivalent results as the reference Broad Institute GATK 4 best practices pipeline as the same algorithm is used.
  • Up to Date Support of All Tool Versions: Parabricks’ accelerated software supports multiple versions of BWA-Mem, Picard and GATK and will support all future versions of these tools.
  • Visualization: Parabricks generates several key visualizations real-time, while performing secondary analysis that can improve the user’s understanding of the data.
  • Single Node Execution: The entire pipeline is run using one computing node and does not incur any overhead of distributing data and work across multiple servers.
  • Turnkey Solution: Parabricks software runs on standard CPU and GPU nodes available on the cloud or on-premise, and requires no additional setup steps by the user.
  • On-Premise and Cloud: Parabricks software can run on local servers, AWS, Google Cloud, and Azure.

  Please contact info@parabricks.com for further information.

Affected Products

High Performance Computing Solution Resources, Poweredge C4140
Article Properties
Article Number: 000142563
Article Type: Solution
Last Modified: 21 Feb 2021
Version:  3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.