STT - Qlik Replicate Best Practices

Troy_Raney · Jan 20, 2023 7:12:51 AM

Environment

Qlik Replicate

Transcript

Hello and welcome to the January 2023 edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session. Today's presentation is Replicate Loading Best Practices with our own Michael Litz. Michael, why don't you tell us a little bit about yourself?
Hey Troy. Sure, I'd love to. Matter of fact, I just got a plaque for my 10th year of service with Qlik in Support group.
That's very cool.
I started doing this basically when they came up with the Replicate product about 10 years ago.
Okay, great. Today, we're going to be talking about full load versus reload; we're going to be examining a lot of different options for setting up those kind of tasks; and Mike's going to go through a lot of configuration best practices. So, Mike, let's start with a full load. What kind of details we need to know about a full load task?
Sure. The full load basically is the very first thing the task will do. So, I'm going to do a brand new task, loading all tables.
And this is going to be a full load with CDC or just the full load?
It'll be a full load with CDC. You get to pick whether you want to do a full load and apply changes. And basically, once we do the full load for the first time, the tasks run anytime after that you would be doing a reload. And we can either reload the entire task (which is all the tables) or we can reload a single table.
Right.
And then I need to put in my Source database which is SQL Server.
All right, just dragging it over. I love that feature.
Yes, me too. And then I'll drag over my Target; and now I'll do table selection, and this is going to go out to my Source; and it'll search my Source for all my tables. In this case, I have about 16 tables I'm going to select them all.
16, yeah.
I'll save over here.
Are there any other settings we need to do there with like schema or anything like that?
Yes, actually. Thanks for reminding me. I'm going to put a global rule in here because I want all of my tables to go into this schema. So, I'm going to do a rename scheme at the global level that'll affect all the tables.
Okay.
And then I'll say finish, and I save it. Now when I go to do the run, I only have the option of Starting Processing. So, I'm going to start processing.
All right.
And you can see that the task is started I've completed 16 tables they've loaded. So, I want to bounce out to SQL server.
And there are the 16 tables. Since this task has ”priority” in the title, how do we look at priority?
I'm going to view my logs, and I like to download my logs.
Okay. Yeah what are you looking for when you go in here?
Yeah, I'm looking for the load order. So, I'm going to look at load finished; we can see the load finish for table number one.
Okay.
And then if I go again, you'll see the next one is ”float test.”
I see the ID=2. So, that's table 2?
Yes, and the important thing to notice with this is: we're loading in the tables in alphabetical order.
All right. So, since they're all set at equal priority it just took them in alphabetical order?
Right, and you can see that if we go back to the Design mode, and we look at my selection there's Customer, Float, My_Cust. So, it takes them in this order.
Okay.
So, what we can do in the task is: come into here and change my customer table and I want to say that it's actually going to be the lowest priority. So, this would be the last table loaded. Go back out to my last table loaded and –
Okay.
And I want to set him to be the highest priority. You see it's pretty granular when you're looking in here.
It's kind of cool how you have 7 levels of priority there.
Yeah, yeah. You know, most people will want to load their biggest tables first.
Right, like if you had a 10 hour load on one table that would get priority. That's pretty cool.
Yeah. So, now that I've got the priority set. So, I’ll reload this. The significant thing we want to see here is when we view the logs this time; so, now I'll do that same find of ”load finished,” and we find next on that; and you'll notice now that the very first instance of a ”table loading” was T_Cust.
Right, that was the table you changed to have the highest priority?
Yes, and then if we scroll down- yes. So, that was my first one, and now he's my last one. So, that's a quick demo of what the prioritization on table loads do.
Michael, I noticed a warning sign in the log messages window about a LOB. Could you take a look and tell us what that means?
Okay. Well, it does have a LOB field which stands for a large object. And there are B LOB (binary large object) or C LOB (character large object). When you have a table with a LOB and there is no primary key on the Source, the task is going to drop that field automatically. Because if there is no PK, it would have to do a scan of every record on the Source to find the actual record that it needs to load from.
I imagine that could slow things down a lot.
Yeah, it would be non-tenable.
So, without that primary key, tables with large objects will just get skipped. Can you show us how to get around that limitation?
I'll come back into the Designer mode. I'm going to go to my Task Settings for my change processing, and this is actually a button to enable or disable. So, I've now removed change processing. This is strictly a full load only task.
Okay.
Now I'm going to save this, and I can reload my Target; and now that we reload the Target, you can see we didn't get any error messages; and in fact if we go out to SQL Server now.
Okay.
You can see here that we've got my field My_LOB is here.
And that's that table with large objects?
Yeah, but essentially that's the technique of loading up a table with no primary key that has LOBs.
What about a situation where you've got like a really large table and you're only trying to bring over a certain part of it? Like you need to filter data from just this year?
Okay. Yeah. Let me show you, that that would be a Full Load Passthrough Filter. And the way we do that is in the Designer mode, and we highlight our table. And to bring up the Full Load Passthrough Filter portion of the screen. We have to hit the Ctrl button and then we hit Table Selection; now when we go out to the filters, I have this Expression Builder under this Full Load Passthrough Filter. My filter is to say only bring c_custkey records that are greater than 5.
Okay.
So, now go into Task Settings, and I want to get into my logging, and I want to go down to my SOURCE_UNLOAD. I move this to Trace. The log will now contain that SQL statement that got sent up to the Source. And the other one if you're debugging a Full Load or a Reload, the other half of that obviously is the TARGET_LOAD.
Okay.
So, you want to have set these two if you're looking at what happened during the load.
Okay. That'll give us a little more information on those two actions?
Yes, Yes Correct. So, I'll leave this guy set. I'm going to save these changes. I'll come back over and do my reload.
Okay. Looks like that completed.
And we'll go out here. I can refresh this select. We have the 6records, but here's where it's important: that SOURCE_UNLOAD. So, as I look at this new log, and if I search now for the ”select.”
That's that filter you're looking for, right?
Right, and normally all we send would be this if I hadn't put a Full Load Filter, and that would bring every single record back.
Okay.
But because I did the Full Load Passthrough Filter, here's that extra little bit.
That greater than 5?
Yes. It adds the where greater than 5, and the other thing to do if you're doing a complicated filter, ”where clause,” you could actually write it outside of the task; make sure it gets you the rows you want and like native SQL; and then you can copy the where clause back in to that section we just took a look at.
So, what would happen with change data capture to records outside of that filter? Can we go back to Source and SQL?
Yes.
Yeah, like if you change one of those less than 5 records. Okay, you've changed the record that doesn't exist in the Target. So, how does Replicate deal with that?
We captured it, we saw it coming in.
Okay.
If you look at the Aggregates, look like we applied the update, but if we look at this message here:
”Source changes that would have no impact were not applied to the Target database.”
So, what we can do is: take a look in SQL Server Target side, and you'll notice that record a c_custkey 3 is not here.
Yeah.
So, that actually is problematic, because if you have millions of updates and one of them broke like that one did; and it would slow your tasks down a lot.
So, how can we use the filter but keep it from causing problems with CDC?
If we go back into the task Designer mode; I'm going to stop this task because I'm going to have to make a configuration change. I want to highlight here. I'm going to hit the Ctrl button and I want to hit Table Settings. So, I can still see c_custkey is greater than 5.
Okay.
So, what we want to do now is in our Record Selection Condition. This happens during the CDC portion. So, I want to put in here that the c_custkey has got to be greater than 5.
Okay. So, that'll apply to CDC (the change data capture). So, it's only going to look for changes greater than 5 now.
Right, and if a change comes in on something less than 5, the task will just ignore it and it won't error.
Okay.
Yes, now we have the two places. And a big difference to notice here is this Full Load Passthrough Filter uses the Source syntax. And on this Record Selection Condition, we're actually filtering it at the task level; we have to use the internal representation of that field which is the same name, but it has a $.
Okay. And as I understand it, the Record Selection Condition filter there, that's filtering on the Replicate side. That's why you’re using kind of Replicate syntax. And the Full Load Passthrough Filter that you had to press Ctrl to get; that's filtering on the Source side. So, it's using the Source syntax.
That is correct, absolutely. That's very very well put actually, yes.
Okay. Tthat makes sense. Well, using that Passthrough Filter is great for limiting the number of tables a task needs to look at, and it helps knowing where those filters are actually applied. I'd like to move on to doing some Inserts, because I know a lot of customers want to be able to add a create date or an updated date in a database. Can you show us about that?
Alrighty; and this is really cool because customers often want that, you know when was it Inserted in the Target when was it updated in the Target. So, the way we do that is go into Designer mode.
Okay.
I'll highlight the customer table. I'll do Table Settings, and I'm going to do a Transformation, and I'm going to Add a Column, and I'm going to call this InsertDate. Okay.
I'm going to make it a date time field. Now I'm going to put in a pre-written expression.
Okay. And where can people find that expression, is that in an article perhaps?
Oh yeah. There's articles out here about doing Insert Date, Update Date, and how they work.
Okay.
So, what I've done on this InsertDate, I'm basically saying when the operation is an Insert, then put the date time now.
Okay.
So, what I want to do is just get rid of this bit for a minute. Just so we can see what happens here.
Okay.
Okay to that. And I'm going to now add another column. And this one I'm going to call UpdateDate; and I'm going to change him to be a date time. Once again, I have a cut and paste for this. I'm going to do the same check of the AR_H_OPERATION. If it's update, then I'm going to say date time now anytime the records update. So, it'll constantly be getting updated.
Okay. So, the big difference there is an Insert is a very specific operation; we're Inserting something new into the record. And Update is just any change. Is that the difference?
That is correct, yes.
Okay.
And I'm going to Save. I'm going to come back and do a Reload of my Target; and as I come out to SQL Server. So, now what you can see out here is: I have that InsertDate, and this is what the task put in while I was doing that reload. But the update date hasn't got anything, because I haven't made any updates.
Okay.
So, what I want to show you now; because I didn't put a condition in on the InsertDate to preserve that date, when I go to Update modify ”Troy” to be ”Mike” –
Or ”ike”
”ike's” good enough. Now we go back to our task; and in the Monitor mode, we'll look at the Change Processing; we see we had an update. So, let's go look at what happened out in SQL server. And you'll notice that there's our record ”ike” that I did the change on, and you'll see the UpdateDate got in there, but what happened to my InsertDate? It went away. And the reason that happened was that InsertDate is not in the Source, it's only in the Target.
So, how can we preserve that InsertDate? Was that the bit that you removed?
Yes, that is correct. It's a Target lookup which will preserve the state. So, let me come over, and I'll do that real quick in the task. So, we're back to the Transform screen; and you can see this is pretty much unconditional, didn't do anything additional. It just said if it's an Insert, then put the date time now. And what I need to do is modify this to say: else if it isn't an Insert, Target lookup and I go let's go look up into this schema that table lookup that field using this c_custkey that I'm passing through from the task, and I will be able to get that InsertDate back into the task. So, that when it writes it out to the Target the InsertDate will be there.
But basically that should preserve that date instead of clearing it out next time the record is updated?
Absolutely. And and that's pretty important to do; otherwise you've lost your InsertDate. Oh yeah, and I'll make this back to ”Troy.” We can see in the Monitor mode under Applied Changes that an Update came through. Now when we look at this same column, we change this to ”Troy,” but that Target lookup preserved the actual date of the Insert.
Great.
And the update works. So, that's the little trick with getting an Insert date or an Update date. And then even if it's in CDC and a new record gets Inserted, that'll also show up as well.
That's great. Now Mike, I've heard about using Parallel Loading for large tables, but I don't know much about it. Would you explain how those work?
Oh yeah, good question. When we do a load of a Source Table, we basically go sequentially; there's one thread making the load. If we go to the Designer mode and I go after my customer table > Table Settings, and we can come over to the Parallel Load. So, I'm going to use Data Ranges.
Okay.
And I'm going to select my segment column. they need to be a key column. So, I'm going to select c_custkey, and I'll say: okay. And now I'm going to Add a Segment, and what I want to say here (the upper limit of the range). So, I'm going to say load up to 3.
Okay.
And then I'll do another segment; and I'll say load up to 7; and I'll do another segment and say 9. What this will do when I say: ”okay” here. And you're going to see that there were actually three different subsections loading at the same time for this table.
And then we'll need to take a look at the log to see how that performs?
Yes. So, I'll reload this. And if we look at the Full Load > Completed. Table loaded up, all 9 rows got out there. When I look at this log file; and I come down –
I see ”loading segments.”
Yes, that's the difference. With that Parallel Load, we start to load that table Customer in segments. In this case, we have multiple segments of the Customer table loading in parallel. You can get the table over there much quicker.
Okay. So, for like a really huge table (millions of rows), this is a good way to increase performance; to kind of load it in sections or segments All In Parallel instead of doing it from top to bottom? Is that we're saying?
That's it. That's it in a nutshell.
That's awesome! I know as an admin for Replicate, it's really nice to get notifications. Can we take a look at that and talk about some best ways to set those up?
Oh yeah. No problem. Notifications are set at the Server level.
Okay.
And I'll do a new notification of a task event. What I want to set my notifications for are: the task was started, Full Load was started, and the Full Load Completed. Let's do that one: Full Load Completed.
Okay.
And there's a default email list or also the Windows Event log. Most people will put their notifications to the Windows Event log. After it's in the event log, they'll use some sort of a program to scrape the event log. Otherwise, you'd be getting emails for for everything that you had notification for, and you may not want to do that. So, I'm going to put mine in the Windows Event log.
Yeah.
And this notification has some defaults in here where you can get your Server Name, the Notification Name, the Task Name.
Wow. So, you can really customize it?
Yes, you can also modify these by putting in some of your own field, or I could even put hard-coded Fields.
I see it gives us a lot of options there, but this is pretty cool. ”My Message” nice. Yeah. A better thing would be doing this customization in your notification. If you do have something scrubbing the event log, you can start to search for this notification code.
Yeah.
And then, you know. Okay. You know, maybe one person only wants this notification to get sent to them; and this is a way that you can customize the notification. We can also add all sorts of Internal Fields like: Number of Records loaded, that sort of thing.
Very cool.
Yeah. So, let me hit OK on that. And now we can do it for every task or we can go out and select tasks. I'm going to select InsertDate. So, I'll say next on that. I'll say finish. I'll save this at the server level. We'll come back out to the task; made some configurations; we're going to reload this task and we should be able to see the notification pop up. And that completed pretty quickly. So, at this point I need to go and have a look at the Event Viewer.
Does anything pop up in little notifications window on the right? That tab next to log messages?
Oh, you know what, yes.
And there's your notification code equals 12.
Yep.
Nice.
Yeah, and then it would also be out in the Event Viewer.
That's all right; but it's nice to know that it's there, and you can even set that up to send emails if a task fails or maybe it hits some sort of performance threshold that you're concerned about. That's a lot of really great options with those notifications. I think those are pretty cool.
Yeah; yeah they definitely are.
Okay. Great. Well, now it's time for Q&A. Please submit your questions through the Q&A panel on the left side of your On24 console. Michael, how about if we just take them from the top,
All right. Perfect.
Okay. First question: is it possible for Qlik Replicate to compress tables on an SQL Target or a SQL Target after it updates the tables If we do manual compression, the updated tables lose that compression. Do you understand that?
Yes I do, and no it's not possible to have the Replicate compress them.
Okay. Well that's pretty simple. It seems like an extra step; maybe somebody needs a little more Hardware space if they're having to do compression. Okay.
Moving on to the next question: what can affect a table's performance during loads in CDC? I'm trying to understand why some tables are faster than others.
Yes, and that that speaks to the Dark Art of tuning. But basically, it's it's a pretty complex thing because every setup is so much different (you know) based on network bandwidth, tables sizes, how many Transformations you have, lookups, that sort of thing. The the things I always look for is number of rows in the table.
Okay.
If you have millions of rows, that's going to be very difficult during the full load, and that's why we look just now at that last task with the parallel segments to help speed that up. The other thing is the width of the table. A lot of times, tables may not have that many rows, but they're very very wide rows; they have a lot of information in them and that can affect it. A third thing (as I said earlier) was if you have a lot of Transformations that have to be done during the CDC portion or the full load portion. That can slow things down. One of the big things in CDC that will affect your performance is the volume of transactions. So, if you have a transaction table that's constantly getting updated, then the logs and the Source are going to have lot of updates, and then we're going to have to process all those. So, at the volume of transactions that are being done on the Source.
Well, that's that's a lot of good points to highlight for people. I appreciate that.
Next question: how is the metadata handled or how does the metadata affect the performance of a Reload?
So, that's kind of good to know. It's negligible. We do the metadata one time only when we start up the task, and then we've got that cached in our memory. We don't ever really do anything again with it.
Okay. Next question: is Replicate compatible with Oracle Data Bricks?
You know, I was looking that up. We do have a good place to go: if you look at the supported platforms and endpoints.
Ooh, this is good to know.
This is where we list what we do support. The rule of thumb pretty much is: if it isn't listed here, then it's not supported. Having said that; like I was in here earlier looking down here and I could see, I could see a bunch of Data Bricks.
I see Data Bricks, yeah.
Yeah. Azure Data Bricks, Delta, lake house. So, there are Data Bricks supported. We don't support Data Brickson Oracle, and it does not even run on the Oracle Cloud. So, we do run on AWS Azure and Google Cloud Data Bricks are fine there, but not on Oracle Cloud.
Okay. But this is a really great resource. I'll provide a link to that with the recording of this session.
Okay. Moving on to the next question: what to do if a task encounters a Max row size limit?
Well, there's a couple of things you can do. One of the ones is to go into your table in the task, and we could go in here, and take a look at the Table Setting and Transform. If you had too many fields here or some of these fields you knew were really huge fields, like maybe there's a LOB that you don't really need; what you can do then is: just highlight that and remove it. So, we won't actually send it to the Target. And that'll help get you past a Max Row Size limitation on the Target. Because a lot of people, they'll just say “oh give me the (you know), I want the whole table,” but sometimes people don't need everything in the table. So, you can theoretically remove some fields and get under that.
Yeah. Next question: will Qlik Replicate work with Oracle OCI Plugable Databases? This is a very specific question.
Yep. Yes, in fact we do support Oracle OCI Plugable Databases. And essentially their multiple databases contained inside a container database. And when you are trying to get to a plugable database; because it's inside of another container database, one of the big things people don't know to do is: you got to connect directly to the plugable database, because it's intuitive and you say I'll just connect to the container, and that container could have six or seven plugables in it. So, you have to specifically connect to the plugable database. But the good news is: Yes, we do support it.
That's great to know. Okay. Moving on to the next question: what is the best recommended version of Qlik Replicate?
The best recommendation I have for the latest version would be to contact support. If you already have something or you're looking to get the latest version, contact us. The latest version right now is 2022.11.286.
Okay.
It's a SP01.
Great.
The other thing is: we have multiple versions that are valid at any one time. So, if you're on 2022.5, you may not want to get 2022.11, because that's a an upgrade. In a sense, you're getting to a new…what's the word I'm looking for?
Right, I understand. There's different iterations of each major release.
Yeah, thank you. I just couldn't seem to say that right.
I understand what you mean. There's a major release and different versions of each of those major releases. So, 22.5 is a little bit more mature at this point than the initial release of 2022.11 I guess.
That is correct. And in fact, that's why we already have an SP01 out there.
Okay.
But yes. And the other thing is if you're already on a version and you're going to do a minor upgrade, you typically don't have to do as much testing, because it wasn't a major but major change.
Okay. Next question: we're getting a lot of notifications. Is there a way to limit the kind of notifications that are sent to those indicating data Integrity issues when a task fails?
Yeah. There's two ways to do that; and one of them would be to set up email rules; so that in your own email, like if you're getting email notifications; you can set up rules in there that will parse the email and and move it to different folders or not. So, we don't have any granularity inside of the task when you set up the notification, but you can do something as an email rule.
Okay.
A lot of times customers don't like that. So, that's why I said earlier, they like to put the notifications to the event log; because there are third-party programs out there that'll scan the event log (you know), that are set up to do that, and they're very easy to program logic into that to do what you're asking to do; (you know) limit what notifications get sent.
That makes sense. It's easier to put everything into the the Windows Log, and then parse out what you want out of that Windows Log and scrape that instead. That's a good tip. Okay.
Next question: after a task stopped before finishing, how can you restart the task where it left off?
Okay. So, two parts to that. If the task is in the middle of the full load phase; let's say you've loaded 50 out of 150 tables; and you still have the other 100 tables to go, if your task stops then, the only thing you can do is reload it. And it has to go through and reload everybody. That's one side of it.
Okay.
The good news is :if you've gotten through your full load and you've stopped your task during your CDC processing, then you can just come right back in and do a Resume, and I can show you that on screen. We've seen it, didn't really talk about it, but that would just be to do the Resume here. And in the middle of a CDC portion of the task, if it stopped and you resume it; it'll pick up right where it was and start applying changes.
Great! We've got some more questions coming in. Let's see: is Qlik Replicate free if you have a Qlik Cloud account?
You know, we have free demos available on the web. So, you know, we we can give you a demo to download and play with for free, but when it's time to really use it, you need to contact your account manager and get a license. Typically the licenses are based on the Source and Target endpoints.
Okay. Next question: how can we make changes to the Primary Key field?
Okay. My first answer to that is: ah, don't do it! Changing, changing Primary Key field yeah, it's always a complicated issue. We do have a method to deal with it. So, for instance, if you're adding a segment to a PK; like you have a single segment PK. Maybe it's ID, and then you're going to make a change on your Source to have a second segment, another field. We do have a way to do that that. You won't even have to reload your task. You'd have to stop the task, and then manually make the changes for your additional segment on the Source; manually change the Target to have that additional segment too; do an advanced run. And with the advanced run, you would do a Create Missing Tables.
Okay.
Yeah. So, this will refresh the metadata, refresh the key, and the task will just do this all behind the scenes. No data will move; the task will stop when this run is done, and then you can resume your task.
Great.
And then the other one which is probably a lot easier is: just do a Reload of the table.
That makes sense as well.
Yeah, unless it's a 9 million row table; it takes two days.
All right. So, we're getting towards the end here. Next question is: how can we investigate latency in our tasks?
Latency is extremely difficult to determine the root cause for. There's so many moving factors like: Network size, packets dropping, all sorts of things can happen. One of the things in support we're always recommending: get a Professional Services person to look at latency, because they can take the time and they have the tools to really analyze all the moving pieces.
I'd also like to let people know about a special Techspert Talks with their own Kelly Hobbs focusing on just that subject that came out recently. It's called ”Troubleshooting Qlik Replicate Latency,” and it looks at how that happens and how you can investigate it further.
Yes, thanks for reminding me. I took a look at that it's a very good. Very good video.
Yeah, we can do a link to that as well. Next question is: which Hardware attribute is more important for Qlik Replicate server performance: more RAM, more processors, or faster processors?
If you have to pick one, it's Ram. It-
Yeah.
The more memory you have, the more you can keep everything in memory. It prevents writing it out to disk into our sort of files, or any other operations that take time. So, Ram is the key, but all three are applicable. Once again, that's the Dark Art of fine-tuning your Hardware. But for sure, everyone agrees the more RAM you have the better you are off.
Good to know. All right, last question: how can you adjust the number of segments that can be loaded in parallel to improve performance? And I think you kind of demoed that a bit.
Yes, that was our final demo that we ran through. By setting up and tweaking those segments to help you Parallel Load for the full load.
Well, this has been really great, Michael.
Thank you Troy. I want to thank you very much for giving me the opportunity to demonstrate a couple of the tips and tricks related to full loads; and limitations with full loads; and techniques to get faster full loads; increase your performance by segmenting the load; and or by doing pass-through filters on the Source. And we have many many articles in the community that relate to various topics in and around full loads and load techniques. So, I encourage people to go there and search and find answers through the community.
Okay, great! Thank you everyone. We hope you enjoyed this session. And thank you to Mike for presenting. We appreciate getting experts like Mike to share with us. Here is our legal disclaimer. And thank you once again. Have a great rest of your day.

STT - Qlik Replicate Best Practices