Clinical Data Warehouse — OMOP CDM with dbt
Overview
Implemented an OMOP CDM v5.4 transformation layer on top of a clinical Delta Lake using dbt (data build tool) running on Databricks SQL warehouses, enabling research and analytics teams to run standardized phenotyping queries without touching raw EHR data.
Architecture
- Source: Silver-layer Delta tables containing normalized patient, encounter, order, and result data from the HL7 lakehouse pipeline
- Transformation: dbt models map source clinical concepts to OMOP domains (Person, Visit Occurrence, Condition Occurrence, Measurement, Drug Exposure) using custom concept mapping tables maintained in Delta
- Vocabulary: OMOP standard vocabularies (SNOMED, LOINC, RxNorm) loaded into Databricks Unity Catalog and joined at transformation time
- Testing: dbt tests enforce referential integrity, concept coverage thresholds, and null constraints on mandatory OMOP fields
- Orchestration: dbt runs scheduled via Azure Data Factory with incremental materialization strategies to limit Databricks compute cost
Key Outcomes
- OMOP CDM enabled cross-site cohort queries that previously required manual data extraction requests
- dbt test suite caught 3 upstream schema changes before they propagated to research consumers
- Incremental dbt models reduced daily transformation compute time by 65% vs. full refresh