Genomics Variant Ingestion & QC Pipeline

Scalable PySpark pipeline on Databricks for ingesting, validating, and annotating VCF files from whole-genome and targeted sequencing panels, stored in a Delta Lake variant catalog on Azure.

September 1, 2024
Databricks PySpark Delta Lake VCF Genomics ADLS Gen 2 Python Azure

Genomics Variant Ingestion & QC Pipeline

Overview

Designed a cloud-native pipeline to ingest VCF (Variant Call Format) files produced by sequencing instruments and external genomics labs, apply quality control filters, and populate a queryable variant catalog in Delta Lake.

Architecture

  • Ingestion: VCF files land in ADLS Gen 2 via ADF copy activity triggered on blob creation events
  • Parsing: PySpark jobs on Databricks parse multi-sample VCF headers and records, flattening INFO and FORMAT fields into structured Delta tables
  • QC Filters: Applied GATK-style hard filters (QUAL, DP, GQ thresholds) and flagged low-confidence calls with a qc_pass boolean column
  • Annotation: Variant records joined against ClinVar and gnomAD frequency tables (pre-loaded as Delta tables) to enrich with clinical significance and population AF
  • Gold Layer: Per-patient variant summary tables partitioned by gene panel and sequencing run, optimized with Delta Z-ordering on chromosome/position

Key Outcomes

  • Processed 500+ VCF files per batch with consistent sub-30-minute end-to-end latency
  • QC filter logic reduced downstream false-positive variant review burden by ~40%
  • Delta time-travel used to reproduce historical variant calls for audit and reanalysis requests