Re: Using Global Rules to get source and target co... - Qlik Community

svraousa · ‎2023-10-03

When we use global rules AR_H_COMMIT_TIMESTAMP OR AR_H_TIMESTAMP for source DB commit timestamp; and datetime('now','localtime') for target commit timestamp, difference between these two timestamps in BQ data is in seconds. In change processing tuning we setup 300secs and 301 secs for Longer than but less than fields. Is this considered a latency of 5 minutes? and hence shows a difference of 5 min to update from QLIK to BQ between batches?

Also,"Is there a way we can setup a trigger that can kill over-running ingestion jobs to BQ"?

Example: Let's say I have Qlik Replicate task running every 5 minutes, If job 1 (At 10 AM), isn't completed (write data to BQ), before job 2 ( At 10:05 AM ) starts, can we kill the job 1 in this case, before job 2 starts?

Thanks,

Venkat

Heinvandenheuvel · ‎2023-10-03

Your interpretation of the latency seems correct.

'BQ' does not mean anything to the qworld in general. May we assume that's just a given name for the target DB?

Why are you stopping and starting tasks? Is this a CSV source? What is the source?

If you are concerned about 5 minute or less latency than you should leave the task(s) running.

Instead of worrying about how to work around 'over-running' you may want to step back and see if the multi task design is the appropriated way to go. What is the high ;evel problem being solved?

Hein.

svraousa · ‎2023-10-03

Hello from Detroit and thaks for the response Hein.

1st part of our question was, if we have a built in latency of 5 minutes based on the above settings for longer than and less than, should target db reflect the latency also, meaning 5 minutes + the time Qlik Rep took to write to target? What is observed is the target commit timestamp doesn't show the 5 minutes plus, but only shows the time it took to write to target. The only thing is data is not reflected in the target until the 5 minutes elapsed. So, wanted to confirm the expected behavior when using the above settings in an Apply Changes task.

2. This is a separate question from the above. When we have a scheduled task running say every 30 minutes, and for whatever reason, the batch is not committed to target for more than 30 minutes, it appears that the later batch committed in source is appearing in target ahead of the older batch of transactions, thereby upsetting some required sequencing of events. So our team wanted to know how to prevent this from happening-killing the job or any other method. This is the higher level problem.

We do have separate tasks - one for processing changes, a CDC task and another for Full Load only scheduled to run at intervals, if this is what is interprted as multi task design. And if this is not, what would be a recommended approach?

Source is sql server.

Heinvandenheuvel · ‎2023-10-03

Sorry, I read you base note too quickly, because it all sounded rather stupid. And on re-reading I do still think it is all rather silly.

So when a change comes in, and it is committed it is send from the SORTER engine to the TARGET_APPLY thread, with the transformations already applied. Thus indeed I would expect just a second, or a few seconds between COMMIT time and added column with tranformation datetime('now','localtime') as long as the SOURCE_CAPTURE thread can keep up with the change generation. Now the TARGET_APPLY thread can hold up sending data to the target typically only to potentially improve performance by collecting bigger batches of changes to apply - causing target target latencies while collecting/waiting. If you want to measure the true latency to target for inserts you'll have to add a column on target, unknown to Replicate, defaulting to the current time. For (non-merge) updates you'd have use a trigger.

Over the years I've seen several folks have trying to play clever tricks to use the tuning parameters for other non-intended purposes, like the 'update very 5 minute' you seem to desire but that's mostly been rather futile in the end. Good luck with that anyway!

>> We do have separate tasks - one for processing changes, a CDC task and another for Full Load only scheduled to run at intervals,

WHY? You do not trust the CDC process? HOW do you switch back to CDC without loss?

And if you do not, then have no mercy. Just KILL -9 (Linux) or TASKKILL /F /PID (Windows)

Next run the task again with FULL-LOAD and CDC Enabled. Never just FULLLOAD

If you are still convinced just running in a scheduled task is a good idea, then kindly try to explain that to us so we can all learn. Yes I'm being sarcastic. I can be because I do not work for Qlik. Merrily retired. Just trying to help!

Cheers!

Hein.

svraousa · ‎2023-10-04

Hi Hein, thanks again for taking these questions and responding, even though retired, to make us understand how Qlik tool works. Your contributions are always considered important to us in Qlik community.

Just to add some clarity to the questions, Full Load only and CDC tasks have different sets of tables. Tables in Full Load task were included based on 'not having PK's' or based on 'limitations documented in Release Notes'.

CDC (Apply Changes with SQL Merge and Cluster by PK's) task; and Full Load task are writing data to BigQuery(BQ) target endpoint.

Primary motivation of asking these questions was to confirm or deny:

a.whether slot consumption increases because of the way SQL Merge works by deleting batches in BigQuery target and inserting batches; and hence store changes is the way to go to reduce slot consumption

b. whether data displayed in target adds 5 minutes of latency set by the tuning params[ Is it correct to assume from your response that these 5 minutes are used only to collect bigger batches and hence displayed in the target only after 5 minutes plus any additional time taken to commit BUT not to add these 5 minutes to the target commit timestamp?]

c. If Full Load only task running every 1 hour, say starting at 10 AM isn't completed (write data to BQ), before job 2 ( at 11 AM ) starts, can we kill the job 1(10am in this case), before job 2 starts? From the response lokks like we can using , KILL -9 (Linux) or TASKKILL /F /PID (Windows).

Thanks.

Heinvandenheuvel · ‎2023-10-04

@svraousa it's good to see you took my reply in good spirit.

>> Full Load only and CDC tasks have different sets of tables.

Ah! That's critical information and something which was not known to the readers unless I missed something. May we assume that the full loads are scheduled with a good margin, like expecting them to run for a few minutes or even half an hour but re-scheduling every hour allowing them to run 2 or more times longer than usual before hitting trouble. This has also better be the case otherwise you rarely have a stable target table state as you'd be truncating the 'first' tables again soon after the 'last' were loaded. That feels wrong.

In this case it is tempting to 'kill' and HOPE whatever reason made it get stuck / slow down the last run has gone away and re-run will finish in time.

If the schedule it 'tight' leaving only minutes normally and over-running a bit then it would feel ashame to not allow it to finish, and have that dataset be available.

I think you should just test (with a test table and maybe re-scheduling in minutes) what happens if the Replicate scheduler finds a task still running. Just see how it reacts, see what shows up in the logs. I expect the re-activation while active to fail. You want some operational method in place to detect that and decide whether it was 'bad luck' or indicative of a problem which needs to be addressed.

Hein.

Using Global Rules to get source and target commit timestamps

Configuration