Implementing Change Data Capture (CDC) in a Unity Catalog-Based Lakehouse Architecture
As organizations scale, full data reload pipelines quickly become inefficient and risky. Reporting refresh windows grow longer, source systems experience increased load, and data duplication issues begin to surface.
In our recent Unity Catalog-based Lakehouse implementation, we modernized incremental data processing using a structured Change Data Capture (CDC) strategy. Instead of reloading entire datasets daily, we captured only incremental changes across CRM, ERP, HR, and finance systems and governed them through Unity Catalog.
This blog explains how we designed and implemented CDC in a production-ready Lakehouse architecture, the decisions behind our approach, and the technical patterns that made it scalable.
Centralized Incremental Control Using Metadata Configuration
One of the first challenges in CDC implementations is avoiding hardcoded logic for every entity.
Instead of embedding incremental rules inside notebooks, we designed a centralized configuration table that drives CDC dynamically.
Each record in this control table defines:
- 1. Source system (CRM, Business Central, Zoho people, Zoho books, Quick books etc.)
- 2. Entity name
- 3. Incremental field (e.g., modifiedon, systemCreatedAt)
- 4. Primary key
- 5. Data purge flag
- 6. Ingestion timestamp
This allowed us to manage incremental extraction logic centrally without modifying pipeline code for every new table.
Fig – Azure Storage Table showing IncrementalField and Timestamp columns
Why This Matters
This configuration-driven design enabled:
- > Adding new entities without code changes
- > Supporting different incremental fields per source
- > Managing ingestion behavior centrally
- > Improving maintainability and scalability
Most CDC blogs discuss theory. Few show how incremental control is actually governed in production.
Bronze Layer: Append-Only Incremental Capture
Once incremental records are identified, they land in the bronze layer in Delta format.
Key design decisions:
- >Append-only ingestion
- >No transformations
- >Ingestion metadata added (batch_id, ingestion_timestamp)
- >Raw source structure preserved
The Bronze layer acts as the immutable change log of the system.
This ensures:
- >Reprocessing capability
- >Audit traceability
- >Recovery from downstream failures
Bronze is not for reporting. It is for reliability.
Structuring CDC Layers with Unity Catalog
To ensure proper governance and separation of concerns, we structured our Lakehouse using Unity Catalog with domain-based schemas.
Each environment (dev, test, prod) had its own catalog. Within each catalog:
- >Bronze schema for raw incremental data
- >Silver schema for curated tables
- >Gold schema for reporting-ready datasets
(Unity Catalog Bronze schema view)
Why Unity Catalog Was Critical
Unity Catalog ensured:
- >Controlled access to bronze and silver layers
- >Clear ownership of schemas
- >Isolation across environments
- >Restricting Power BI access to Gold only
CDC without governance can become fragile. Unity Catalog added structure and security to the incremental architecture.
Silver Layer: Applying CDC with Delta MERGE
The Silver layer is where CDC logic is applied.
We implemented Type 1 Change Data Capture using Delta Lake MERGE operations.
The logic follows:
MERGE INTO silver.target_table t
USING bronze.source_table s
ON t.primary_key = s.primary_key
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
This approach ensures: - > No duplicate records
- > Updates overwrite previous values
- > Pipelines remain idempotent
- > Safe reprocessing
If a job runs twice, the data remains consistent.
We intentionally chose Type 1 because reporting required the latest operational state rather than historical tracking.
Handling Late-Arriving Data
One common CDC failure point is late-arriving records.
If extraction logic strictly uses: modified_timestamp > last_run_time
Some records may be missed due to clock drift or processing delays.
To mitigate this, we:
- >Implemented a small overlap window
- >Relied on MERGE logic to prevent duplicates
- >Periodically validated record counts
This ensured no silent data loss.
Governance and Power BI Integration
A key architectural decision was limiting Power BI access strictly to Gold tables.
Through Unity Catalog:
- >Bronze and Silver layers were restricted
- >Access was managed through Azure AD groups
- >Service principals were used for BI authentication
- >Environment isolation prevented cross-environment contamination
This ensured reporting teams could not accidentally query raw incremental data.
The result was a clean, governed reporting layer powered by curated Delta tables.
Performance Optimization Considerations
To maintain optimal performance:
- >Delta tables were periodically optimized
- >Small file management was monitored
- >Partitioning aligned with business date columns
- >Incremental loads reduced overall compute consumption
Compared to full data reloads, incremental CDC significantly reduced cluster runtime and improved refresh stability.
Common CDC Mistakes We Avoided
During implementation, we intentionally avoided:
- >Insert-only incremental logic
- >Granting BI access to Bronze tables
- >Ignoring delete handling
- >Hardcoding incremental timestamps
- >Full table overwrites
These mistakes often appear only after production failures.
Designing CDC carefully from the start prevented costly refactoring later.
Business Impact
By implementing CDC within a Unity Catalog-governed Lakehouse:
- >Reporting refresh times became predictable
- >Source system load reduced significantly
- >Data duplication issues were eliminated
- >Governance improved across environments
- >New entities could be onboarded quickly
The architecture is now scalable and future ready.
Conclusion
Change Data Capture is not just an incremental filter. It is a disciplined architectural pattern.
When combined with:
- 1. Delta Lake MERGE
- 2. Metadata-driven incremental control
- 3. Unity Catalog governance
- 4. Structured Bronze-Silver-Gold layering
It becomes a powerful foundation for enterprise analytics.
Organizations modernizing their reporting platforms must move beyond full reload pipelines and adopt structured CDC approaches that prioritize scalability, reliability, and governance.
Ready to deploy AIS to seamlessly connect systems and improve operational cost and efficiency? Get in touch with CloudFronts at transform@cloudfronts.com.