Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello everyone,
I want to connect to Oracle Database and monitor the changes on a source table and need to capture the changes every 5 mins and put the updates and inserts in a separate mysql tables. I also want to do a few data quality checks in between too. What is the best CDC approach for this? Trigger Based or Log based? And can someone show me an sample example of the dataflow.
Thanks in advance.
Hi Pras1258,
Since it is a trigger, the data will be moving to the interim table as soon as it is loaded into source database. You will not be able to control that aspect. Also if we are trying to fetch the change data from source every 10 minutes, we are actually diluting the core essence of CDC.
For example, any change to a record which might have happened in between those 10 minutes will not be recorded in CDC table.
If you want to poll the source tables, every 10 minutes, a good idea will be to use tDBInput and add a where clause to pick only the interval data.
Coming to your second query, if the data is not extracted from interim table by tOracleCDC component, it will remain there in the interim table. Another scenario is that you have extracted from stage table using that component but your job failed before loading to target. In that case, better to extract and keep to a temp file as first stage so that you can refer that file to restart the job rather than extracting all the way from source (by default, CDC interim table will be automatically cleared once you extract data using tOracleCDC). You can go to advanced option and select the tick box for Keep data in CDC table. But in that has, the onus of deleting data will be on you.
Warm Regards,
Nikhil Thampi
Thanks for the reply Nikhil.
My company maintains Oracle 10g and 11g databases. The tables I am watching contains around half a million rows of data and I want to capture the changes once every 15 minutes only. My DBA wants us to make sure there is no impact on the performance of the database. At this point, what would be your suggestion? Log based or Trigger based?
Do we use some sort of a scheduler to bring the Inserts,updates once very 15 mins or does have any functionality for that?
Ok. Let me reconfirm this. So, the CDC monitors the changes to the source table and published in a change table. So the change table wants to get the changes once every 15 mins. Can you provide me a sample ETL job to help me on this? That would be awesome.
Hi,
Unfortunately I do not have a Talend code handy at this point of time. But I have done Trigger based CDC sometime back for a customer and it worked fine. The steps mentioned in the Talend document is sufficient to create sample jobs.
Again I would like to give a word of caution that ultimately the amount of change data in underlying database will determine the performance of CDC. So I would strongly recommend to do performance tests before going to production.
Warm Regards,
Nikhil Thampi
Thanks Nikhil.
Can I slow down the CDC Triggers to fetch the data from source only every 10 mins and then load that modified data into a Staging Database and then I want to push that data into a target data warehouse?
What would you do if data didn't successfully load from stage to target but it made from source to staging? What would you do to only move from stage to target when such situation occurs.
Hi Pras1258,
Since it is a trigger, the data will be moving to the interim table as soon as it is loaded into source database. You will not be able to control that aspect. Also if we are trying to fetch the change data from source every 10 minutes, we are actually diluting the core essence of CDC.
For example, any change to a record which might have happened in between those 10 minutes will not be recorded in CDC table.
If you want to poll the source tables, every 10 minutes, a good idea will be to use tDBInput and add a where clause to pick only the interval data.
Coming to your second query, if the data is not extracted from interim table by tOracleCDC component, it will remain there in the interim table. Another scenario is that you have extracted from stage table using that component but your job failed before loading to target. In that case, better to extract and keep to a temp file as first stage so that you can refer that file to restart the job rather than extracting all the way from source (by default, CDC interim table will be automatically cleared once you extract data using tOracleCDC). You can go to advanced option and select the tick box for Keep data in CDC table. But in that has, the onus of deleting data will be on you.
Warm Regards,
Nikhil Thampi