Clinical Data Warehouse — OMOP CDM with dbt

Transformed a multi-source clinical Delta Lake into an OMOP Common Data Model using dbt on Databricks, enabling standardized cohort analysis and federated research queries across patient populations.

March 1, 2024
dbt Databricks OMOP CDM SQL Delta Lake Data Modeling Azure Python

Clinical Data Warehouse — OMOP CDM with dbt

Overview

Implemented an OMOP CDM v5.4 transformation layer on top of a clinical Delta Lake using dbt (data build tool) running on Databricks SQL warehouses, enabling research and analytics teams to run standardized phenotyping queries without touching raw EHR data.

Architecture

  • Source: Silver-layer Delta tables containing normalized patient, encounter, order, and result data from the HL7 lakehouse pipeline
  • Transformation: dbt models map source clinical concepts to OMOP domains (Person, Visit Occurrence, Condition Occurrence, Measurement, Drug Exposure) using custom concept mapping tables maintained in Delta
  • Vocabulary: OMOP standard vocabularies (SNOMED, LOINC, RxNorm) loaded into Databricks Unity Catalog and joined at transformation time
  • Testing: dbt tests enforce referential integrity, concept coverage thresholds, and null constraints on mandatory OMOP fields
  • Orchestration: dbt runs scheduled via Azure Data Factory with incremental materialization strategies to limit Databricks compute cost

Key Outcomes

  • OMOP CDM enabled cross-site cohort queries that previously required manual data extraction requests
  • dbt test suite caught 3 upstream schema changes before they propagated to research consumers
  • Incremental dbt models reduced daily transformation compute time by 65% vs. full refresh