Multi-Source EHR Pipeline Orchestration with Airflow
Overview
Designed and deployed an Apache Airflow orchestration layer to coordinate a complex clinical data platform with multiple ingestion sources, dependent transformation jobs, and data quality gates — replacing a fragile mix of cron jobs and ADF triggers.
Architecture
- DAG Design: Modular DAGs per data source (hospital ADT feed, reference lab ORU feed, instrument LIS export) with shared task groups for Bronze → Silver → Gold promotion
- ADF Integration: Airflow
AzureDataFactoryRunPipelineOperatortriggers ADF pipelines for bulk file ingestion; sensor tasks poll for completion before downstream Databricks jobs run - Databricks Operator:
DatabricksRunNowOperatorkicks off Databricks notebooks for transformation and FHIR conversion with parameterized run configs - Data Quality Gates: Custom Airflow operators run Great Expectations checkpoints against Delta tables; DAG branches to a quarantine flow on suite failure rather than halting the entire pipeline
- Alerting: Task failure callbacks post to a clinical ops Slack channel with run context and link to Airflow logs; PagerDuty alert on SLA miss for critical ADT feeds
Key Outcomes
- Consolidated 12 disparate cron/ADF trigger sequences into 4 maintainable Airflow DAGs
- Data quality gates prevented two upstream schema breaks from reaching gold-layer tables consumed by BI dashboards
- SLA monitoring surfaced a recurring 3 AM feed delay that traced back to an upstream hospital maintenance window