Multi-Source EHR Pipeline Orchestration with Airflow

Apache Airflow DAGs orchestrating a multi-source EHR ingestion platform that coordinates Mirth Connect polling, ADF triggers, Databricks job runs, and data quality checks across a clinical data lakehouse.

December 1, 2023
Apache Airflow Databricks Azure Data Factory Python EHR Delta Lake Azure HIPAA

Multi-Source EHR Pipeline Orchestration with Airflow

Overview

Designed and deployed an Apache Airflow orchestration layer to coordinate a complex clinical data platform with multiple ingestion sources, dependent transformation jobs, and data quality gates — replacing a fragile mix of cron jobs and ADF triggers.

Architecture

  • DAG Design: Modular DAGs per data source (hospital ADT feed, reference lab ORU feed, instrument LIS export) with shared task groups for Bronze → Silver → Gold promotion
  • ADF Integration: Airflow AzureDataFactoryRunPipelineOperator triggers ADF pipelines for bulk file ingestion; sensor tasks poll for completion before downstream Databricks jobs run
  • Databricks Operator: DatabricksRunNowOperator kicks off Databricks notebooks for transformation and FHIR conversion with parameterized run configs
  • Data Quality Gates: Custom Airflow operators run Great Expectations checkpoints against Delta tables; DAG branches to a quarantine flow on suite failure rather than halting the entire pipeline
  • Alerting: Task failure callbacks post to a clinical ops Slack channel with run context and link to Airflow logs; PagerDuty alert on SLA miss for critical ADT feeds

Key Outcomes

  • Consolidated 12 disparate cron/ADF trigger sequences into 4 maintainable Airflow DAGs
  • Data quality gates prevented two upstream schema breaks from reaching gold-layer tables consumed by BI dashboards
  • SLA monitoring surfaced a recurring 3 AM feed delay that traced back to an upstream hospital maintenance window