STT - Troubleshooting Qlik Replicate Latency

Troy_Raney · Dec 21, 2022 7:42:46 AM

Environment

Qlik Replicate

Transcript

Hello everyone and welcome to this special Techspert Talks sessions. I'm Troy Raney and I'll be your host for today's session today's presentation is Troubleshooting Qlik Replicate Latency with our own Kelly Hopson. Kelly, why don't you tell us a little bit about yourself?
Hey Troy. Thank you. My name is Kelly Hobson and I am a Tech Support Engineer at Qlik. I've been here for 1 year now and I support Qlik Replicate which is a popular tool in our Data Integration suite, and I also support Qlik AutoML.
Great. All right, and today we're going to be taking a look at Replicate specifically; and we're going to be talking about latency, and what that means; go into the details of some best practices; how to troubleshoot it; and we're definitely going to walk through a demo. So, Kelly, for those of us who aren't that familiar, what is Qlik Replicate and how does it work?
Sure. So, Qlik Replicate is a tool for moving data from point A to point B.
Right.
It's a very powerful tool for data replication and streaming data across a wide variety of Source and Target endpoints. And the main functions are the kind of two parts: the full load, which grabs the query directly from the Source and brings it to Target; and then our Change Data Capture or CDC technology, which remotely scans transaction logs to bring changes over in real time. And this is the real bread and butter of the tool is our CDC capabilities.
Right. That's like the power. It keeps everything up to date and live, and I guess that brings us into the topic for today. So, how would you define latency then?
Sure. When the data is behind expectation or schedule, that's when we encounter what we call Latency.
Right.
And on a technical level, the overall latency is defined as the time gap between when a change is committed on the Source and when it's visible and live on your target database.
Okay, and in your experience, how does this Latency tend to pop-up for users? Are there different types?
Sure. Whether it's from the Source, the handling side or on the Target, for example this first latency type is the Source Latency. It's the gap in seconds between the Source database when I wrote the event to the transaction log and when Replicate captures the change. You can think about for air travel, if you're sitting in the airport and your plane isn’t there yet; there's a delay from the Source side from that plane getting there for you to be able to continue on your travel. So, for example let's say you have an Oracle Source database; and the read speeds from Oracle are low, whether it be like a network speed or just the distance away from Oracle; that can be that slow down from the Source side.
Okay, where else can there be Latency?
The other type is the Handling Latencies. So, it can - think of this as the processing. And again, if you go back to this air travel example; this might be delays in the time to board or loading all the bags, delays and getting off the ground and going again. Okay.
From a Replicate side, delays with our sorter. So, if you have many changes building up on disk; it's having a hard time keeping up with those changes; then it's bogged down and more of the processing that happens within the Replicate engine.
Okay. Is there another type of Latency?
The last type is Target Latency. So, this is any delay in getting the data from the processing stage to the Target. And you can think of the Target Latency as the Source Latency + the Handling. And here you can see this Performance Trace Logging which is that: Target latency of 2.35 seconds is the sum of the Handling and the Source Latency.
Right. Because ideally that would be almost instant, but when things slow down, that's where Latency comes from. So, I understand you've got a demo set up for us. What type of environment are we going to be looking at?
Sure. This is Replicate 2022.5.0.652.
Okay.
And here we have a task that's configured with a postgres Source, and then a Snowflake Target End Point. Full Load/CDC enabled for one table called ”public.results.”
Right. And can we take a quick look at that table so we can have an idea what it is?
Sure. What I've run. So, far is this Create statement. Not a huge table, simple. On the task, I've run a reload of the table.
Okay.
So, if we go to the monitor the full load completed. There's no records in it yet.
Right.
And then when I resume the task, that's when it'll be ready to start processing changes.
Okay. We've seen through PgAdmin we've got a clean table with a few columns and there's no records in it, but the full load's already completed.
Right.
Now, it looks like the task is not running, right?
That's right. I can go ahead and resume the task.
Okay. But, here we can see that the Change Data Capture is waiting for changes, but right now there's nothing happening. Can we take a look at those task settings?
The first thing: this full load setting is DROP and CREATE. That's why I recreated it on the Target. But the most important thing is within Change Processing > Change Process Tuning, we are in Batch Optimized Apply, which is what we recommend. But what I've done is: I have have adjusted the tuning parameters to make it very small. And they're having to make a lot of trips back and forth; and then I've also shrunk down the Total Transaction Memory so that when a big group of changes, it sends immediately to disk to then be processed.
You’ve really choked it down. And this is (I guess) the opposite of best practices, this is like worst case practices, right?
Yeah, worst case; and you're really restricting that in-memory capability which is when it's doing well, is when it's being able to keep things in memory, move it quickly.
Right. So, these are settings set to keep it out of memory; just constantly ship tiny packages multiple times a second. And this is all set up this way, not because it's a good idea, but so we can see some serious latency, right?
And then the other one is this Advanced Tab, this Max File Size set to 1MB, which is those CSVs, you're sending them one at a time of that size to Snowflake.
Okay. So, it's going to be looking for super small changes on the Source side; sending over are small packages often; and on the Target side, it's only going to bring over tiny packages as well, so –
That's right.
Everything is really choked. What kind of changes have you prepared to put pressure on this bottleneck?
And so, I'm going to switch over to PgAdmin for our Source database.
Okay.
Now, we're going to insert 10,000 changes.
All right.
And this will get picked up by Replicate.
So, 10,000 inserts to the Source side. Let's see how Replicate handles that.
If I come to apply changes it handled that 10,000 fairly well. So, now I want to bump this up to a million.
Okay. So, now going to be adding an extra million rows.
Okay, and we'll see here in a moment –
Those changes should be coming in. There they are.
And we see that it's accumulating on disk at the moment. And then it will start applying those changes. And here we can see this Apply Latency starting to creep up. It's calculating that amount of time that it's now behind to get that to the Target; and the Supply Latency is again is kind of the combination of the Source and the Target Latency.
Okay.
Under this Apply Changes; this is a good way to watch those apply changes coming through.
It's ticking up a bit.
I also want to switch to the Apply Throughput tab. This will show you the Source and Target throughput. The Source spiked right at the beginning to bring it onto disk; the Target throughput is lower.
So, ideally you would want to see those two graphs between Source and Target to kind of mirror each other and be almost the same?
It depended on the customer settings.
Yeah?
This is potentially with Professional Services involvement. This throughput can be increased to get that throughput up to what you need.
Okay.
If you have millions and millions of changes coming through, then you might need a beefy server to be able to handle high throughput. So, now I'll switch over to this Apply Latency tab. This Target value is continually creeping up as it's getting farther behind; but you can see Source is staying low, because we didn't have any latency grabbing it from postgres.
Okay. So, if someone has the situation or they suspect there could be some latency in their system; how could someone investigate or start to troubleshoot what's happening?
Sure. If you're here and your task is running, you don't want your task to stop; go to Tools > Log Management; and I've already done it for this task; the Performance to Trace, providing information about if it's coming from the Source or the Handling Latency. So, toggling this to Trace, clicking OK will save that setting without stopping the task. For some cases, when your latency is occurring and you're not getting error messages; turn on Source Capture or Target Apply to Trace, and that Trace logging will give more information.
Great. So, that will write more information about the performance in the logs to help understand what's causing the latency. And where are those logs stored?
Attunity > Replicate > Data > Logs.
You've got a task called Squeeze.
Squeeze. This is the most recent log.
Okay. Okay so, when you look at these logs, it starts it's kind of chronological, right? Everything starts from the top, so the most recent information will be down at the bottom? Is that right?
That's right. This is going to produce every 30 seconds.
Okay.
So, this Handily Latency: 72, 103, 133, 163, ... It's it's creeping up.
Oh yeah. I see there.
As this has progressed. So, that's where you can see the processing of it was where the latency is coming from.
Right, okay. Okay, is it possible to set up some kind of alert when latency gets high like this?
Sure. If I go over to Server, you're allowed to set up task level events on performance resources. And you can say: if latency exceeds 300, you trigger a notification or email. And then you clear it when it drops below 200.
Okay, that's pretty cool. So, you can set it up to create and send email alerts to let admins know if some latency hits a threshold?
Correct.
That's really cool. So, if this did hit that threshold; say if you up the inserts to 3 million, what would that look like?
So, over here on the other side of log messages is this Notifications. It's a little bell. Oh.
You can see, we got the message this ”replication task exceeds the defined limits. Current latency 305 seconds.”
Okay.
It's notifying here, but also could be configured for email alert to send that email out. Cool. All right, we've seen how the worst practices perform. How can we adjust these more along the lines of best practices? Can you walk us through those settings?
Sure. if we go back to Designer tab > Task Settings > Change Processing > change these parameters to open up the pipes for the data to flow through more quickly through this process. Set the less than 16 seconds greater than 59. And then I'm also going to increase this to 2048.
Okay.
To force apply batch when I see this amount; and then for the Transaction Offload, we'll set this as 5,000. And then this Transaction Duration Exceeds 60,000. And by setting this, we're basically forcing it to use this parameter option, and make
sure that we're using more memory resources, rather than sending so much to disk. Okay. Now, how did you come up with these settings specifically?
These settings, we recommend to customers via a best practices document that we share around with our Professional Services group. And this is something that we typically advise customers on when they have support cases with latency related to Snowflake.
Okay. So, these are specific to Snowflake endpoint?
They are, but they also (you know) can be tested and tuned. We recommend obviously like on a test environment for other endpoints that have similar style to Snowflake.
Are there any settings on the Snowflake Target side that have been hurting the performance as well?
Yeah. So, let me go ahead, I'm going to stop the task.
Okay.
Go to the Snowflake endpoint > Advanced Tab > and I'm going to bump this up to 250 MB. Click on Save, close, save the task. And then we can do a Run Resume and it will pick up where it left off. I'm going to go ahead and do another 1 million.
Okay. So, that should come in as another incoming change after the last 3 million. And then here: Throughput. There we go > 180 it’s doing much better for this Target Throughput versus more like 3,000 or 7,000 that we were seeing before. And the Source is going to shoot up, because we did this next one. And then I think it's done with the 3 million, and it's back to the 1 million that we just sent. That's all being applied in memory, and the Latency has gone down to where it's just working with that insert that we did.
All right. So, where can someone find some more information about how to resolve latency issues like this?
On our community page, we also have a Troubleshooting Guide that has a series of questions to be able to narrow down where the latency may be coming from. This performance output is very important for us to see.
Yeah, this is a great checklist to help people identify what where the latency is; what might be causing it, so they can resolve the issue themselves or at least gather all the information to get some more help. Thanks!
Thanks so much, Troy. This is definitely it's an important topic. We see this a lot. And for customers who have downstream consumers of their data, Latencu is a big sticking point for people to resolve quickly.
Thank you everyone. We hope you enjoyed this session; and thank you to Kelly for presenting. We always appreciate getting experts like Kelly to share with us. Here's our legal disclaimer; and thank you once again. Have a great rest of your day.

STT - Troubleshooting Qlik Replicate Latency