When we build a data pipeline the first step in the process is to identify the data source and how to extract that data in order to process.
During data extraction INSERT, UPDATE, SELECT queries can be performed without any concern but DELETE operation might be something we should think twice. DELETE operation might be hard delete and once data is deleted from the source we cannot capture it in data pipeline.
In this article, I will explain 3 ways to capture DELETE operation during the data extraction process.
If the source table in the database has update_timestamp then we can query the database based on what has changed since your last job run.
SELECT * FROM DB.table WHERE update_timestamp>last_run_time
In this case, you can extract soft delete(generally with _isdeleted=true flag) operation but we cannot capture hard delete operation.
Also, if update_timestamp is not indexed, running the above query can cause performance issues as well.
If the database performs hard_delete operation then one approach can be set up triggers on delete in the database.
So if the hard delete operation performed in the table then the trigger will write down necessary information into another table. see below example
As you can see above, as soon as UPDATE & DELETE operation performed on the original table, it will cause the trigger to create the record in the trigger_table.
Now to get the whole picture of the data we just have to perform right join of the original table on trigger_table and that shall solve our problem of capturing hard delete.
Since now we have incremental data, we can update data in our data lake or data warehouse by applying changes from the original data source.
Use redo/bin logs
A most robust approach to capture hard delete operation would be using bin logs (redo logs). Basically database writes all the DB operations(INSERT, UPDATE, DELETE, SELECT) in the logs for various reasons. So Instead of bothering the database for the data, it is always good practice to use bin logs (redo logs).
With the help of bin logs (redo logs) data can be constructed and hence deleted data can be captured.
Debezium is the project that extracts data from various data sources in realtime using bin logs (redo logs).
Using bin logs /redo logs for capturing hard delete operation is always better to approach than creating a trigger. Various ETL tools use bin log to stream data changes from source to destination example (https://www.alooma.com/).
But if utilizing redo logs or bin logs is not possible then the trigger table might alternative for it.
I appreciate your time. I hope it was beneficial to read this article.