Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello guys!
I'm having issues with Qlik Replicate and would like to get some insights on it.
We have a MS-CDC SQL on premises configured as source and the target is an Azure Hyperscale database.
We have a job scheduled for a fullload at 2 AM, which it executes flawlessly and is followed by the CDC activation.
The issue is, at around 6 AM, Qlik Replicate stops applying any changes to the Target, building them on Disk as the Azure SQL db CPU goes to 100% non-stop.
Bellow a few images:
Azure DB dedicated to this Qlik Replicate Task, you can clearly see the 2 AM full load and at 6 AM there is this non-stop CPU utilization, and changes are not applied anymore.
Replicate side:
It happens every day around 6AM,
On the Source side, a few jobs run in the server between 2 AM to 6 AM, but would this be the responsible for breaking the Task? is there any kind of configuration that could be made to bypass this? I know that some tables are truncated and recreated on the source side (if this could help understanding the root cause).
There are no logs or messages that would shed a light on it, the task seems to be running normally, but something on this might be causing Azure issues which stop receiving CDC changes.
Cheers
No need for a verbose log from Replicate - you have the query causing the issue from the DB. Good work.
It appears that the table in question [dbo].[fact_devicereadings] does NOT have a proper primary key only a unique key allowing NULL. I suspect the DB was doing a tablescan for each changes. This is because you need tablescan to find a NULL value as there is are no index entries for NULL.
Just Clone this table in DEV and fill (real) sample value and test. Set the CDC task setting "Limit the number of changes applied per change processing statement to" a low number like 100 or 1000 and see how long each chunk takes in the reptask log (set to TARGET_APPLY DEBUG) and compare with target DB top SQL statistics.
You wrote "tables are truncated " - was it really a proper TRUNCATE, or perhaps a full/partial table DELETE. Can they use a truncate on the source if a delete was being used? (maybe this is a consequence of using MS-CDC - I don't know).
Can they change to real Primary Key?
As it is you may want to exclude the table from the CDC task, and stick it into a full-load only task to be run after the source side daily reload (at 6am).
Hope this helps some,
Hein.
>> They are not using a TRUNCATE on the source, but a DELETE statement (would a Truncate work with MS-CDC? I thought it was one of the limitations)
Right. That's what I thought. The delete will come in, and is treated as 1 delete per row. Without PK this is a tablescan per row which is very slow. With PK it is still slow but better.
Indeed, truncate may well be an MS-CDC restriction. I don't use/understand MS-CDC so I don't know for a fact.
Another thing to consider is to pre-delete (or indeed truncate) the rows on the target manually (scripted of course). You can possibly even do this on the fly when you see it is 'stuck', to release it out of it's misery and those tablescans will still happen but the table will be empty. Much easier :-).
Hein.
Seems to me you are misinterpreting what you are seeing.
>> Replicate stops applying any changes to the Target.
What makes you say that? It did not stop! It is just waiting for the target to be done before adding the next change request - but the target is never going to give it change, doing its loopydyloop.
>> but would this be the responsible for breaking the Task?
The task is not broken. The target fails to perform the requested applies, It is trying... but spinning its wheels. Replicate gave it work and it never responded. Your debug task is to enable details (verbose) TARGET_APPLY logging in Replicate to hopefully catch what it is asking the DB to do, and at the same time ask the DB what the active top CPU DB Query is. Do those two match , does that query they give a clue as to why it never finishes?.
Can you run that query is isolation (after taking a copy of a CSV file or attrep_changes table perhaps. I'm not saying Replicate might not be doing something wrong, generating a wrong query, but for now Replicate is right is waiting for the DB to finish what it asked it to do.
>> the task seems to be running normally,
Correct.
>> Azure issues which stop receiving CDC changes.
It didn't stop receiving - it failed to finish the last request as it is still working on it.
>> ? I know that some tables are truncated and recreated
That might be trivial on the source, but from the graphs you provided it looks like 40M changes. At for example (just a WAG) 10,000/second that would be 4000 seconds = 1+ hour worth of work. Not instant, but not the 4+ hours you appear to see either.
Good luck,
Hein
Hello Hein
Thanks for the Reply, I've tried to look a bit deeper into Azure, it seems that this query is the responsible for the whole thing:
And speaking with the client, this table content is Deleted and Inserted again every morning, so there is a DELETE FROM / INSERT INTO statement, but probably the target got stuck at the DELETE FROM statement
Does this helps with something?
Since it only happens at 6 AM, I will set a full load earlier in the morning with the verbose logging to see if we get more information. Will keep in touch.
Thanks again,
Matte
No need for a verbose log from Replicate - you have the query causing the issue from the DB. Good work.
It appears that the table in question [dbo].[fact_devicereadings] does NOT have a proper primary key only a unique key allowing NULL. I suspect the DB was doing a tablescan for each changes. This is because you need tablescan to find a NULL value as there is are no index entries for NULL.
Just Clone this table in DEV and fill (real) sample value and test. Set the CDC task setting "Limit the number of changes applied per change processing statement to" a low number like 100 or 1000 and see how long each chunk takes in the reptask log (set to TARGET_APPLY DEBUG) and compare with target DB top SQL statistics.
You wrote "tables are truncated " - was it really a proper TRUNCATE, or perhaps a full/partial table DELETE. Can they use a truncate on the source if a delete was being used? (maybe this is a consequence of using MS-CDC - I don't know).
Can they change to real Primary Key?
As it is you may want to exclude the table from the CDC task, and stick it into a full-load only task to be run after the source side daily reload (at 6am).
Hope this helps some,
Hein.
Hello Hein
They have added an ID as primary key to the table. I've also isolated it in a single for better troubleshooting, but still whenever it gets to the reload part of the table, the latency builds up and nothing happens.
But now the query using the CPU change from a DELETE, to an INSERT:
(@P1 int,@P2 int)INSERT INTO [dbo].[fact_devicereadings] ([ReadingType],[DateTime],[DeviceKey],[Value],[DateKey],[TimeKey],[ID]) SELECT CAST ( [attrep_changes391EBA80F118EA57].[col1] as nvarchar(4000)) COLLATE Latin1_General_CI_AS , CAST ( [attrep_changes391EBA80F118EA57].[col2] as datetime2(7)) , CAST ( [attrep_changes391EBA80F118EA57].[col3] as uniqueidentifier) , CAST ( [attrep_changes391EBA80F118EA57].[col4] as float) , CAST ( [attrep_changes391EBA80F118EA57].[col5] as int) , CAST ( [attrep_changes391EBA80F118EA57].[col6] as int) , CAST ( [attrep_changes391EBA80F118EA57].[col7] as int) FROM [attrep_changes391EBA80F118EA57] WHERE [attrep_changes391EBA80F118EA57].[seq] >= @p1 and [attrep_changes391EBA80F118EA57].[seq] <= @P2
They are not using a TRUNCATE on the source, but a DELETE statement (would a Truncate work with MS-CDC? I thought it was one of the limitations)
Would be a solution to use a scheduled a reload for this table after they make their delete/insert on the source?
Kind regards
I've separated table into another task, apparently it run alright this morning. the primary key might have solved the problem? let me add the table again to the regular task and wait the next morning, will keep you posted and mark the answer. Weirdly enough, Friday we still had the issue AND the primary key.
>> They are not using a TRUNCATE on the source, but a DELETE statement (would a Truncate work with MS-CDC? I thought it was one of the limitations)
Right. That's what I thought. The delete will come in, and is treated as 1 delete per row. Without PK this is a tablescan per row which is very slow. With PK it is still slow but better.
Indeed, truncate may well be an MS-CDC restriction. I don't use/understand MS-CDC so I don't know for a fact.
Another thing to consider is to pre-delete (or indeed truncate) the rows on the target manually (scripted of course). You can possibly even do this on the fly when you see it is 'stuck', to release it out of it's misery and those tablescans will still happen but the table will be empty. Much easier :-).
Hein.
Perfectly put, I've just checked today the isolated table and it indeed seem to be happening a slow delete with the PK in place. It takes about 3h to scan and delete the 30kk rows. Im going to tweak a bit with the configs and maybe schedule a reload for those DELETE/INSERT tables from the Task, maybe schedule a script as you proposed!
Any way, this issue seem to be resolved. Thank you for the help and inputs on the matter.
Kind Regards!