Implementing Change Data Capture (CDC) in a Unity Catalog-Based Lakehouse Architecture

As organizations scale, full data reload pipelines quickly become inefficient and risky. Reporting refresh windows grow longer, source systems experience increased load, and data duplication issues begin to surface.

In our recent Unity Catalog-based Lakehouse implementation, we modernized incremental data processing using a structured Change Data Capture (CDC) strategy. Instead of reloading entire datasets daily, we captured only incremental changes across CRM, ERP, HR, and finance systems and governed them through Unity Catalog.

This blog explains how we designed and implemented CDC in a production-ready Lakehouse architecture, the decisions behind our approach, and the technical patterns that made it scalable.

Centralized Incremental Control Using Metadata Configuration

One of the first challenges in CDC implementations is avoiding hardcoded logic for every entity.

Instead of embedding incremental rules inside notebooks, we designed a centralized configuration table that drives CDC dynamically.

Each record in this control table defines:

  • 1. Source system (CRM, Business Central, Zoho people, Zoho books, Quick books etc.)
  • 2. Entity name
  • 3. Incremental field (e.g., modifiedon, systemCreatedAt)
  • 4. Primary key
  • 5. Data purge flag
  • 6. Ingestion timestamp

This allowed us to manage incremental extraction logic centrally without modifying pipeline code for every new table.

Fig – Azure Storage Table showing IncrementalField and Timestamp columns

Why This Matters

This configuration-driven design enabled:

  • > Adding new entities without code changes
  • > Supporting different incremental fields per source
  • > Managing ingestion behavior centrally
  • > Improving maintainability and scalability

Most CDC blogs discuss theory. Few show how incremental control is actually governed in production.

Bronze Layer: Append-Only Incremental Capture

Once incremental records are identified, they land in the bronze layer in Delta format.

Key design decisions:

  • >Append-only ingestion
  • >No transformations
  • >Ingestion metadata added (batch_id, ingestion_timestamp)
  • >Raw source structure preserved

The Bronze layer acts as the immutable change log of the system.

This ensures:

  • >Reprocessing capability
  • >Audit traceability
  • >Recovery from downstream failures

Bronze is not for reporting. It is for reliability.

Structuring CDC Layers with Unity Catalog

To ensure proper governance and separation of concerns, we structured our Lakehouse using Unity Catalog with domain-based schemas.

Each environment (dev, test, prod) had its own catalog. Within each catalog:

  • >Bronze schema for raw incremental data
  • >Silver schema for curated tables
  • >Gold schema for reporting-ready datasets

(Unity Catalog Bronze schema view)

Why Unity Catalog Was Critical

Unity Catalog ensured:

  • >Controlled access to bronze and silver layers
  • >Clear ownership of schemas
  • >Isolation across environments
  • >Restricting Power BI access to Gold only

CDC without governance can become fragile. Unity Catalog added structure and security to the incremental architecture.

Silver Layer: Applying CDC with Delta MERGE

The Silver layer is where CDC logic is applied.

We implemented Type 1 Change Data Capture using Delta Lake MERGE operations.

The logic follows:

MERGE INTO silver.target_table t
USING bronze.source_table s
ON t.primary_key = s.primary_key
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This approach ensures:
  • > No duplicate records
  • > Updates overwrite previous values
  • > Pipelines remain idempotent
  • > Safe reprocessing

If a job runs twice, the data remains consistent.

We intentionally chose Type 1 because reporting required the latest operational state rather than historical tracking.

Handling Late-Arriving Data

One common CDC failure point is late-arriving records.

If extraction logic strictly uses: modified_timestamp > last_run_time
Some records may be missed due to clock drift or processing delays.

To mitigate this, we:

  • >Implemented a small overlap window
  • >Relied on MERGE logic to prevent duplicates
  • >Periodically validated record counts

This ensured no silent data loss.

Governance and Power BI Integration

A key architectural decision was limiting Power BI access strictly to Gold tables.

Through Unity Catalog:

  • >Bronze and Silver layers were restricted
  • >Access was managed through Azure AD groups
  • >Service principals were used for BI authentication
  • >Environment isolation prevented cross-environment contamination

This ensured reporting teams could not accidentally query raw incremental data.

The result was a clean, governed reporting layer powered by curated Delta tables.

Performance Optimization Considerations

To maintain optimal performance:

  • >Delta tables were periodically optimized
  • >Small file management was monitored
  • >Partitioning aligned with business date columns
  • >Incremental loads reduced overall compute consumption

Compared to full data reloads, incremental CDC significantly reduced cluster runtime and improved refresh stability.

Common CDC Mistakes We Avoided

During implementation, we intentionally avoided:

  • >Insert-only incremental logic
  • >Granting BI access to Bronze tables
  • >Ignoring delete handling
  • >Hardcoding incremental timestamps
  • >Full table overwrites

These mistakes often appear only after production failures.

Designing CDC carefully from the start prevented costly refactoring later.

Business Impact

By implementing CDC within a Unity Catalog-governed Lakehouse:

  • >Reporting refresh times became predictable
  • >Source system load reduced significantly
  • >Data duplication issues were eliminated
  • >Governance improved across environments
  • >New entities could be onboarded quickly

The architecture is now scalable and future ready.

Conclusion

Change Data Capture is not just an incremental filter. It is a disciplined architectural pattern.

When combined with:

  • 1. Delta Lake MERGE
  • 2. Metadata-driven incremental control
  • 3. Unity Catalog governance
  • 4. Structured Bronze-Silver-Gold layering

It becomes a powerful foundation for enterprise analytics.

Organizations modernizing their reporting platforms must move beyond full reload pipelines and adopt structured CDC approaches that prioritize scalability, reliability, and governance.

Ready to deploy AIS to seamlessly connect systems and improve operational cost and efficiency? Get in touch with CloudFronts at transform@cloudfronts.com.


Share Story :

SEARCH BLOGS :

FOLLOW CLOUDFRONTS BLOG :


Secured By miniOrange