Genomics Variant Ingestion & QC Pipeline
Overview
Designed a cloud-native pipeline to ingest VCF (Variant Call Format) files produced by sequencing instruments and external genomics labs, apply quality control filters, and populate a queryable variant catalog in Delta Lake.
Architecture
- Ingestion: VCF files land in ADLS Gen 2 via ADF copy activity triggered on blob creation events
- Parsing: PySpark jobs on Databricks parse multi-sample VCF headers and records, flattening INFO and FORMAT fields into structured Delta tables
- QC Filters: Applied GATK-style hard filters (QUAL, DP, GQ thresholds) and flagged low-confidence calls with a
qc_passboolean column - Annotation: Variant records joined against ClinVar and gnomAD frequency tables (pre-loaded as Delta tables) to enrich with clinical significance and population AF
- Gold Layer: Per-patient variant summary tables partitioned by gene panel and sequencing run, optimized with Delta Z-ordering on chromosome/position
Key Outcomes
- Processed 500+ VCF files per batch with consistent sub-30-minute end-to-end latency
- QC filter logic reduced downstream false-positive variant review burden by ~40%
- Delta time-travel used to reproduce historical variant calls for audit and reanalysis requests