Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have been investigating high latency in my solution and it seems to be caused by a specific, high activity table (RESB in SAP db).
I noticed this message in the log and wondered if could point me to a more specific cause that I could investigate:
00018076: 2022-12-28T09:39:44 [TARGET_APPLY ]T: Finish bulk because stream timeout (bulk_apply.c:1012)
Thanks
Martin
Hi @Martin11
Not that I am aware of. I encourage you to submit a feature request here if this is something you would like to see in the product: https://community.qlik.com/t5/Ideas/idb-p/qlik-ideas
Thanks,
Dana
Thanks for all the great info, Hein !
I'm going to wait for a time when the latencies diverge and provide the documentation that you suggest.
Thanks,
Martin
Hi Hein
Apologize for the delay.
I can confirm that these tasks are using the same source and target systems.
We have multiple tasks in an effort to troubleshoot / isolate latency issues.
I have attached a file with the JSON of the 'good' and 'bad' tasks as well as their respective logs for about a 15 minute timeframe. During this time the latency of the 'bad' task was 7+ hours and the 'good' task was less than 5 minutes.
Thank,
Martin
Thanks for the detailed info.
At first glance it is strictly volume related where the source change reading for the RESB table cannot keep up.
The 'bad' task runs at 1000+ changes/sec, frequently running out of its default 500 MB memory for batches of 72,000+ changes.
The 'good' task runs ar 70 changes/sec, never running out of its 1800 MB 'tuned' memory with no more than 2500 changes/batch.
Summary of TARGET_APPLY log lines for BAD:
Task <bad__PE3-3 RESB_to_QLIK.txt> from 01/03 11:34:38 thru 01/03 11:50:01. (923 seconds)
Total 37 Batches in log, largest 72508, longest duration 60
1125.1 changes/second overall. 1038481 total, gap 63 @ 01/03 11:48:08. 35 Statement not found.
no_bulk:0, no pk:0, 69 applies, 0 singles, max 36254. noPK-time=0
35 truncates for attrep_changes tables taking 0 seconds
"Start Handling Table" took 17 times more than 1 second. Total 86 seconds
LATENCY| Begin| End| Min| Max| Net|
-------|-------|-------|-------|-------|-------|
Source | 29490| 29761| 29490| 29770| 271|
Target | 285| 418| 0| 525| 133|
Elapsed| | | | | 923|
Summary of TARGET_APPLY log lines for GOOD:
Task <good__PE3_to_QLIK_NOSQL.txt> from 01/03 11:34:00 thru 01/03 11:49:45. (945 seconds)
Total 158 Batches in log, largest 2504, longest duration 13
72.8 changes/second overall. 68767 total, gap 12 @ 01/03 11:36:30. 4339 Statement not found.
no_bulk:0, no pk:0, 8953 applies, 3396 singles, max 509. noPK-time=0
157 truncates for attrep_changes tables taking 4 seconds
LATENCY| Begin| End| Min| Max| Net|
-------|-------|-------|-------|-------|-------|
Source | 4| 7| 2| 22| 3|
Target | 4| 7| 0| 9| 3|
Elapsed| | | | | 945|
Mind you, I don't think giving the 'bad' tasks more memory will help it significantly, as the delay appears to be gathering the changes.
Another way to see the the source reader delay is to see in the log the timestamp for which change is currently being read as committed.
GOOD: 03T11:39:11 [SOURCE_CAPTURE ]T: > txn_id: 00065B58071E, timestamp: 2023-01-03 16:39:08
BAD: 03T11:39:11 [SOURCE_CAPTURE ]T: > txn_id: 00065AD8B438, timestamp: 2023-01-03 08:24:56
GOOD shows 5 hours and 3 seconds difference between replicate log time and the source time stamp. Surely 5 hour due to time-zone and 3 seconds latency. BAD shows 8 hours 15 mintes earlier than GOOD matching the latency reported by 03T11:39:01 [PERFORMANCE ]T: Source latency 29656: 29656/3600 = 8.24
On the SOURCE_CAPTURE side the BAD task with no 'stream_buffer' tuning reports 1127 'FULL STREAM ' messages, whereas the GOOD task which has 'stream_buffer' tuning never reports that. Mind you considering the frequency and sub-second 'recovery time I again do not expect a significant improvement is those FULL STREAM events are tuned away, but one should attempt to tune that just in case so to speak.
Could the source server have 'punished' (Linux NICE) the source reader tasks/thread for the RESB task because it was consuming disproportional resources?
I think it is worth your while to check out LOG_STEARM to have 1, well tuned, source reader task and you can still have your 'control' with multiple slave tasks reading the logstream each selecting their own tables.
I would also be tempted to run the RESB (aka BAD) task in isolation with no other source change log reading and with a NUL target to see what it's absolute max change throughput is.
hth,
Hein.
Hi Hein,
Some questions:
1. Can you give a brief explanation of how you created the charts for Source and Target rows ?
2. The logs uploaded for this thread contain the message "[SOURCE_CAPTURE ]T: Transactions in progress counter is negative". Does this indicate a problem ?
3. GOOD shows 5 hours and 3 seconds difference between replicate log time and the source time stamp. Surely 5 hour due to time-zone and 3 seconds latency. BAD shows 8 hours 15 mintes earlier than GOOD matching the latency reported by 03T11:39:01 [PERFORMANCE ]T: Source latency 29656: 29656/3600 = 8.24
Since both tasks are reading the same source, wouldn't they both have the same offset of time zone ?
Thanks,
Martin
>> 1. Can you give a brief explanation of how you created the charts for Source and Target rows ?
Just to clear, I think you know this, those are LATENCY numbers, not row numbers. They are just from a (perl) scripts parsing the [PERFORMANCE] lines. I'll attach the script. It also created the other summary statements.
>> 2. The logs uploaded for this thread contain the message "[SOURCE_CAPTURE ]T: Transactions in progress counter is negative".
I noticed those but have no good explanations. It feels like a 'start by timestamp' failing to 'wait for transaction consistency'. I that case the source reader could easily, in fact is guaranteed, to see more commits than TX created and thus the count would be negative. I don't recall having seen source drivers do not seem to report this. I don't think they are critical warnings just 'debug' observations.
>> 3. wouldn't they both have the same offset of time zone ?
Yes, that's why I wrote "BAD shows 8 hours 15 mintes earlier than GOOD" so relative to one another, both being 5 hours away from the Replicate server (can you confirm that?)
@Martin11
HI Martin,
Just wanted to get your input/suggestion on this.
Were you able to solve the issue.
We are facing same issue on same Table RESB due to high Transaction on Source.
Latency is reaching over 24 hours in my case.
Thanks,
Kiran
@kishetty03 see hein comment #20782345
Hi Kiran,
We still have sporadically high latency on our RESB table. We have 5 tasks all pointed to the same db source and sometimes they are all at about the same latency - but often RESB (which is in it's own dedicated task), is several hours latent and the other 4 tasks are near zero or at much lower latencies.
Martin
@Martin11 try to enable verbose logging on source and target to see what is slowing, is on the source or target . and then try to narrow down to exact cause of slow.