Call Us: (212) 336-1440-556

CDC(Change Data Capture) with Relational Databases

Post last modified:July 16, 2022
Reading time:3 mins read

Introduction

When we build a data pipeline the first step in the process is to identify the data source and how to extract that data in order to process.

During data extraction INSERT, UPDATE, SELECT queries can be performed without any concern but DELETE operation might be something we should think twice. DELETE operation might be hard delete and once data is deleted from the source we cannot capture it in data pipeline.

In this article, I will explain 3 ways to capture DELETE operation during the data extraction process.

Common Solution

If the source table in the database has update_timestamp then we can query the database based on what has changed since your last job run.

SELECT * FROM DB.table WHERE update_timestamp>last_run_time

In this case, you can extract soft delete(generally with _isdeleted=true flag) operation but we cannot capture hard delete operation.

Also, if update_timestamp is not indexed, running the above query can cause performance issues as well.

Trigger Table

If the database performs hard_delete operation then one approach can be set up triggers on delete in the database.

So if the hard delete operation performed in the table then the trigger will write down necessary information into another table. see below example

As you can see above, as soon as UPDATE & DELETE operation performed on the original table, it will cause the trigger to create the record in the trigger_table.

Now to get the whole picture of the data we just have to perform right join of the original table on trigger_table and that shall solve our problem of capturing hard delete.

Since now we have incremental data, we can update data in our data lake or data warehouse by applying changes from the original data source.

Use redo/bin logs

A most robust approach to capture hard delete operation would be using bin logs (redo logs). Basically database writes all the DB operations(INSERT, UPDATE, DELETE, SELECT) in the logs for various reasons. So Instead of bothering the database for the data, it is always good practice to use bin logs (redo logs).

With the help of bin logs (redo logs) data can be constructed and hence deleted data can be captured.

Debezium is the project that extracts data from various data sources in realtime using bin logs (redo logs).

Conclusion

Using bin logs /redo logs for capturing hard delete operation is always better to approach than creating a trigger. Various ETL tools use bin log to stream data changes from source to destination example (https://www.alooma.com/).

But if utilizing redo logs or bin logs is not possible then the trigger table might alternative for it.

I appreciate your time. I hope it was beneficial to read this article.

Thank you!

Tags: CDC, CDC to RDBMS, change data capture, realtime-replication

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

CDC(Change Data Capture) with Relational Databases

Introduction

Common Solution

Trigger Table

Use redo/bin logs

Conclusion

Leave a Reply Cancel reply

Recent Posts

Introduction

Common Solution

Trigger Table

Use redo/bin logs

Conclusion

Please Share This Share this content

You Might Also Like

Longest Substring Without Repeating Characters — Leetcode #3 ( Ruby Solution )

Remove LinkedList Elements – Leetcode 203

Spring Data JPA: How to bulk insert data

Leave a Reply Cancel reply

Share this content