Genomics Variant Ingestion & QC Pipeline

Overview

Designed a cloud-native pipeline to ingest VCF (Variant Call Format) files produced by sequencing instruments and external genomics labs, apply quality control filters, and populate a queryable variant catalog in Delta Lake.

Architecture

Ingestion: VCF files land in ADLS Gen 2 via ADF copy activity triggered on blob creation events
Parsing: PySpark jobs on Databricks parse multi-sample VCF headers and records, flattening INFO and FORMAT fields into structured Delta tables
QC Filters: Applied GATK-style hard filters (QUAL, DP, GQ thresholds) and flagged low-confidence calls with a qc_pass boolean column
Annotation: Variant records joined against ClinVar and gnomAD frequency tables (pre-loaded as Delta tables) to enrich with clinical significance and population AF
Gold Layer: Per-patient variant summary tables partitioned by gene panel and sequencing run, optimized with Delta Z-ordering on chromosome/position

Key Outcomes

Processed 500+ VCF files per batch with consistent sub-30-minute end-to-end latency
QC filter logic reduced downstream false-positive variant review burden by ~40%
Delta time-travel used to reproduce historical variant calls for audit and reanalysis requests