STT - Using Kafka with Qlik Replicate

Troy_Raney · Nov 22, 2023 2:08:54 PM

Environment

Qlik Replicate

Transcript

Hello everyone and welcome to the November edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session. Today's presentation is Using Kafka with Qlik Replicate with own Swathi Pulagam. Swathi, why don't you tell us a little bit about yourself?
Hi everyone. I'm Swathi, a Senior Support Engineer at Qlik; specialized in Cloud endpoints. I provide support for Qlik Replicate and Qlik Cloud Data Integration; and I have been part of the Qlik team for the past 2 years.
Awesome. Okay, today we're going to be taking a look at the architecture of Kafka; we're going to go through a lot of the Kafka terminology; and Swathi is going to demonstrate how to set up Replicate to work with Kafka as an endpoint; and we're going to take a look at some of the configuration details and even discuss some possible Performance Tuning. Swathi, as I understand it, Kafka is a data source and streaming platform. What's important to understand about Kafka as a Qlik Replicate endpoint?
Yeah, when we create a Replicate task with any source and Kafka as a Target, Replicate will convert all the source records as Messages.
Yeah, so walk me through this diagram here. What are all those terms?
Kafka architecture has four components. Those are: Broker, Zookeeper, Producer and Consumer.
Okay.
Producer will send Messages to the Broker; the Consumer will read those Messages from Kafka Brokers; the Kafka Cluster is built off one or more number of Brokers; and Zookeeper, it is mainly for Cluster management, failure detection, and recovery and it will store the security information like passwords and certificates.
Okay, so a Broker is basically a Kafka server; and multiple Brokers create a Kafka Cluster. I see the Zookeeper there to manage it, like you mentioned; and where would Qlik Replicate be in this diagram?
Yeah, Replicate acts as a Producer.
Okay, and Replicate as a Producer sends Messages to the Kafka Broker and what are Messages exactly?
A Message is a simple array of bytes; so here, if we take this CSV file, each line of this CSV file is considered as a message.
Okay, so it's sending those arrays of data and in Kafka terminology that's considered one line would be a Message?
Exactly. Correct, yeah.
Okay, all right; so, you set up a demo today where we can see how this all applies to Replicate. Can you walk us through the process of setting up Kafka as an endpoint?
Yeah, sure. First, click on the Manage Endpoint Connections; New Endpoint Connection, to select under the role as Target; here name; under type Kafka.
Okay, and just to clarify, Kafka is only supported as a Target, is that right?
Correct, yeah, in Replicate Kafka is supported as a Target.
All right, and I see you some of the Kafka terminology already coming into play. We got the field for the Broker server; and that's basically the Kafka server. What do we need to enter there to connect to that server?
We have to give the Broker server IP address; 9092 as the port.
And is 9092 the default Port?
Yeah, default port. And we can give more than one.
Okay, and how would you separate the multiple servers?
Yeah, like giving a comma.
Okay, and what are the benefits of that?
For high availability, failure detection, and the data will be distributed across all the Brokers. Oh, cool.
For the other Brokers, if we are planning to give more than one, then we have to give the port for each Broker server.
Okay, that makes sense.
Next comes security. The choice should align with the security configuration on the Broker server.
Okay, so whatever security is set up on the Broker server, that's the information needs to be selected here?
Correct, yeah, these are all the authentication options.
Okay.
For demo, currently I'll be going with None; and message properties: JSON and Avro.
Right, so this is the format of those Messages going from Replicate to the Kafka Broker?
Correct, yeah, from Replicate we are sending the message format either in JSON or in Avro format.
And what are the big differences?
Yeah, so JSON is presented in a textual format; whereas Avro is represented in a binary format, and Avro is considered like more efficient due to its binary nature.
Okay, what about the compression settings?
Yeah, 2 compression options: Snappy and Gzip. We recommend using compression. Gzip compresses data 30% more when compared to Snappy, but there is a down side as well. As the data is compressed more, reading will utilize double CPU.
Oh, so it uses twice as much CPU power with Gzip?
Correct, yeah.
Okay, so compression is recommended, but which method depends on the CPU of your system. Which one are you using today?
I will just go with the Snappy.
And now we're on to how to write or publish the message. Could you explain what the term Topic means here?
Yeah, it's a unique name for Kafka stream.
Okay, so in Kafka a data stream is called a Topic?
Correct, yeah, so specific Topic means to publish the data to a single Topic. In case if you're having 5 tables, all the Messages will be sent to Topic A.
Okay, but those need to be pre-created in Kafka?
Yeah, correct. If we want to browse it, the Topic should be already existing; but if we give a Topic name, then Replicate will check if the Topic exist. If it doesn't, then Replicate will create a Topic.
That's nice that Replicate can go into Kafka and create the name of the Topic if you wanted to write everything to a specific Topic. What if you wanted to use the naming convention already used in The Source?
Separate Topic for each table, right, corresponding to the source table names. Consider we are having 5 tables, and for each table a Topic will be created: Table A then Topic A will be created, Table B - Topic B, Table C - Topic C, so…
Okay, so it just kind of correspond to the table names in the Source?
Correct.
Awesome. What are the options for partition strategy?
Random and By Message Key. If we select By Message Key, which means based on this option, either ‘Schema and Table Name’ or ‘Primary Key Columns.’
What would you recommend?
It depends on how they consume the data. Best is Separate Topic for each table. And for this session, we are not going to discuss the Metadata Message Publishing, but if we want to write metadata information of the Messages to schema registry, then we can use this option.
Okay, and any options we need to take a look at on the Advanced tab?
Under Advanced, we are having message maximum size bytes.
Okay, so that's how big the Messages being sent from Replicate to Kafka can be?
Yeah, correct. If we go beyond this value, we could see performance issues like latency increase.
Okay, so increasing that setting could affect performance?
Correct, yeah.
What happens if a message is sent that's larger than the setting here, 10 MB?
If we having large object, and if that large object size is more than 10 MB (like if it is 20 MB), and here if we are giving the max maximum size as 10 MB; then the remaining will get truncated. So, if we are loading more data, then we have to increase this value, but increasing; again, you have to make sure that it will (not) lead to the performance issues.
Okay, so it's good to be aware of the size of the maximum size here of Messages?
Exactly. I'm doing the Test Connection.
All right, great. Now that the Kafka Target endpoint is configured, can you walk us through the process of setting up a task?
Yeah, sure. Click on New Task. I'm selecting SQL Server.
Okay.
And I'll be loading the data to Kafka, and we'll be using the Unidirectional.
All right, data is going one direction.
And selecting Full Load and Apply Changes.
Okay, that's the change data processing?
Correct, yeah, so for Kafka store changes is not supported.
Okay, and I love this visual nature of Replicate, how you build a task with Source and Target. And what other settings are important here?
Yeah, I'm going to the Task Settings. We selected the Full Load, and under Change Processing we selected the Apply Changes. Store Changes is turned off, and Change Process Tuning here; we have to select the Transactional Apply, not the Batch Optimized.
The default is Batch Optimized; but for Kafka, it needs to be Transactional Apply?
Correct, yeah, because even though we select Batch Optimized, we can see in the task logs when running the task, Replicate will send the changes in Transactional only.
Just so I'm clear on the difference between those two: Batch Apply would build a collection or a set of changes, and send them all at once; but Transactional will just repeatedly do them whenever a change needs to be applied? Is that correct?
Correct yeah, Batch Optimized we'll be sending in batches; transaction-wise: one by one we’ll be sending it.
Okay, are these default settings okay?
Yeah, yeah, but later in future, if we see any performance issues; then we'll be doing the Offload Tuning and Batch Tuning, but we'll go with the default settings.

Okay, and what else is important to set here?
Yeah, we have to select the table now.
That's actually identifying the table that we want to send?
Yeah, I'm clicking on the Table Selection; a table “Kafka.”
Oh, there it is.
Save this. And under Run dropdown; I'll click on Start Processing. So, now we are doing Full Load.
Can we take a look at the monitor of how that works? It does it automatically, awesome! Yeah.
All right, so went from Queued to Loading, and Loading to Completed. And I see down in the bottom, it transferred 13 rows from that database.
Correct.
Awesome. How can we take a look at that data?
We can check the data through command line, but I'm using the Kafka client tool Offset Explorer .
Okay, so I see by the IP address that's your Kafka Broker.
Correct, yeah. We can see the Broker Topics DBO.Kafka. Because we select Specific Topic for each table.
Right, I remember that setting you set in the task that it's using the –
Specific Topic for each table, right?
Right. Okay.
I will show the data on the data tab; 13 records, so here from 0 to 12.
Okay.
And partition zero. So. as I did not specify any Partition by default it will have only one.
And the column Key, what is that from?
I selected based on the Primary Key. We are configuring the Kafka Endpoint; this is Message Key.
Oh okay.
We selected Primary Key columns.
Okay, that's interesting how that works.
I just want to show one more thing. For Full Load, I'll copy this to notepad. This is the message.
Okay, it's a long string.
The actual data from the sources – this is the message till here. Because whatever the columns are there, like ID and Name, Year, Salary; and there will be header columns. When we do Full Load, then we'll see the Operation as Refresh.
Okay, so Refresh basically means Full Load?
Correct, after the Full Load is completed, if we do Update, Delete or Insert, then the Operation will be Insert, Delete or Update.
Okay, can we see how Replicate will behave now if there's an update on the source, on the SQL database side?
Yeah, I'm going to update a record in this table.
Okay, so this is what the original Source data actually looks like in SQL?
Correct, yeah, we are having ID, Name, Year, Salary and Test1.
All right.
And now I'm going to update a column. We should be able to see that update in the Replicate.
Okay, so we can see highlighted there is Applied Changes, so we know that some updates have been applied?
Yeah, we just sent one update, and if we see the Aggregates; here we can see the update.
Okay.
Let me take this to the notepad. So, I'm going to paste this.
We're looking for the word Operation?
Operation. Operation: it is showing Update. This is the previous; and Update is the new one. In order to go through the header column information, it is explained clearly in the user guide.
Okay, so documentation about TransactionID or ChangeMask is on help.qlik.com.
Correct, yeah.
TransactionID. Okay, this list of terms and definitions is great because this is how Replicate and Kafka communicate, right?
Yeah.
On that note, can you clarify what Liberty Kafka is?
Replicate when it tries to talk with the SQL Server, it connects using ODBC driver.
Okay.
Same way like when Replicate tries to talk with Kafka, it uses Liberty Kafka. Liberty Kafka is an open-source client tool.
Okay, so Liberty Kafka is the connector that's built in to Replicate to send Messages to Kafka?
Correct, yeah. It is the core component in Replicate to produce the Messages to Kafka.
So if a Replicate admin wants to start doing some Performance Tuning, how does Liberty Kafka come into play?
If there are any issues encountered or if we need to tune Replicate, then we need to understand the available options in Liberty Kafka to play with it; and we need to know how to pass the Liberty Kafka parameters in Replicate endpoint.
Yeah, can you walk us through what that might look like?
Yeah, if we double-click on the Kafka Target Endpoint, Advanced, Internal Parameters, here.
Okay, so this is one of many parameters are available to adjust?
Correct, yeah.
And where can we find a list of those parameters?
Yeah, here.
Okay, we're on GitHub. What are we looking at?
Correct, this is where we'll be having the global level configuration properties.
Okay, so this is the list of all those properties you can adjust; so basically, it's the language of how to do fine tuning?
Correct, yeah. So, in Replicate, we are having only limited internal parameters. Whereas if we come to this page, we are having so many properties. If it is like Global Level, we'll be giving like AR properties; and if it is like Topic Level, we'll be giving Topic configuration properties, right.
It sounds like a great resource; and we'll definitely include the link to this along with the recording of this session. Are there any example properties you want to highlight?
Yeah, so here ‘akcs,’ this is like Topic configuration property. When Replicate is sending Messages to the Kafka Topic, it will wait for the acknowledgement from Broker. In case if we are having more than one Broker, then Replicate will wait acknowledgement from all the t3 Brokers.
Okay, so if we had multiple Broker servers, Replicate would have to wait for all of them to acknowledge a message being sent, before it could send more; but if we adjust this setting to 1 (as an example), Replicate would only need to wait for 1 of the Brokers to acknowledge the message before continuing?
Exactly, then Replicate can send other Messages; internal parameters, as this is a Topic property. We have to give RdKafkaTopicProperties. We'll give this akcs = 1. Now Replicate will wait to get acknowledgement from any 1 of the Brokers.
Okay, and if you wanted to add additional properties?
Yeah, then just give ; and you can keep on adding.
Okay, but with Performance Tuning it's a lot of testing and playing with the values to find what's best. What happens if a parameter is adjusted to a level where things go wrong? Could you show us what that might look like?
Yeah, if I take this ResultsWaitMaxTimes, right.
What does that mean? It's waiting for response from Kafka or what?
Correct, it is waiting by default 500 Mili Seconds response for Kafka. I will just reduce it to 1.
Okay, with only a 1 second acknowledge time, we're expecting this to fail?
Correct, yeah. When we reduce that Max time to 1 second.
Oh, we got an error message, I see.
We can see it failed out; stating that reach a timeout while waiting for aks from Kafka.
Okay, so you set the parameters so low that it failed?
Yeah. Performance Tuning is all about to play in order to get the exact value for the data which we are trying to load.
Okay, so we're not making any recommendations. We're just showing the tools that people can use.
Exactly. If they like to, all right.
All right. Well, I see a lot of questions already come in; so why don't we jump to Q&A? Please submit your questions to the Q&A panel on the left side of your On24 console. And I'll just take them from the top.
Okay, Swathi, the first question is: is data sent via JSON Messages?
Actually like, we are having two formats: JSON and Avro. So, if we want to send the data in JSON format, we can select the JSON.
Okay, so yeah, we have two options; and you kind of showed that previously; that one was a more of a text format, that was JSON; and the other one was binary.
Right binary. Correct, yeah. So here, if we go to the Kafka endpoint right, under the Message Properties Format, we can select whatever we want like, either JSON or Avro.
Great! Thank you for showing that again.
Okay, next question: any recommendations on confluent schema registry configuration? When loading multiple Source tables into a single Topic to ensure schema registry version management; our shop requires manage schema registry version updates, and we would like to know how best to setup Qlik to generate this?
Yeah, Troy. You can use subset name strategy as schema and table name when you're loading multiple Source tables into a single Topic; and your Kafka message format should be Avro to use schema registry.
Okay, that's great. All right, moving on: we're trying to find the optimal LOB size? I guess you mentioned that before, that was large object size, right?
Correct, yeah.
What size LOB would you recommend?
We recommend not to cross Message size more than 10 MB. If you have many LOB column/colums then make sure combining all the columns record size shouldn't be more than 10 MB, because if we go to the Advanced here, Message Maximum Size Bytes is 10 MB right. It's always recommended to go with 10 MB, because R&D have tested and they came up that 10 MB is the best maximum message size.
Okay, well that's great to understand where it comes from, and that 10 MB is the size that's optimal.
Yeah.
Okay, next question: how to generate SSL certificate for confluent Kafka hosted in Azure as a SaaS service? Do you understand that?
Yeah. There are 3 different SSL certificates available for confluent Kafka SaaS. I can share the links for these too. Connect is self-managed Kafka clients to confluent cloud; and we can download and use it.
Okay.
And they can follow this process.
This is another one?
Yeah, like root certificates and everything, intermediate certificates; so they can just download and they can use this.
Awesome. Okay, moving on. Next question: can Replicate use Kafka as a source?
No, Replicate only support Kafka as a Target Endpoint.
Okay, that's fair. Next question: how can we troubleshoot a reached timeout while waiting for akcs error?
Like I have shown you previously; so, giving the parameters like RdKafkaTopicProperties or RdKafkaProperties.
Right. In the Advanced settings, Internal Parameters.
And based on the type of error we are getting; we can go to that link, and we can check under the description, and we can take the property; and we can add it in the internal parameter; and we can try that, we can test it out adjusting that setting, yeah.
Okay, next question: what are your recommendations between Gzip versus Snappy?
Gzip is most compressed method compared with Snappy. So, Gzip will utilize least space and network bandwidth while transferring the data; but while consuming the data, Consumer will need to spend more resources to decompress the data. Gzip is the best compression method if Consumer have more CPU to decompress the data quickly. Whereas Snappy will moderate compression and requires less resources to consume data.
So, Gzip if you have the CPU power; otherwise Snappy.
Yeah.
Got it. All right. Can upgrading the Kafka Library affect an existing Kafka Target?
And Replicate upgrading right? If the question is about Kafka library, then we cannot manually upgrade Liberty Kafka, because when we are upgrading Replicate itself. like Kafka will come with that. Like the Liberty Kafka version also will get changed.
All right, so whenever Replicate is upgraded, the Kafka library is upgraded automatically, because Liberty Kafka connector is built-in?
Correct, yeah.
Okay, next question: why does Kafka make a good Target Endpoint with Qlik Replicate?
Replicate have the capability to provide Messages to the Kafka server from various sources including rdbms and other available sources within Replicate. This process operates near real time, ensure that as soon as the data is loaded in the source, it will promptly Replicate it to Kafka.
So, everything is real time in Kafka as well; so they work well together?
It's near real time.
Yeah, near real time, of course.
Okay, last question today: any tips for troubleshooting issues setting up the SSL certificate? Someone's getting an ssl.ca location failed error.
Okay, it completely depends on how the SSL is configured on Kafka Cluster side.
Okay.
Replicate is just using a Liberty Kafka client. If Kafka server is configured with SSL certificate, then we can provide all the required certificates in PEM file. To troubleshoot SSL issue, we should have open SSL commands for that; also like I can provide a couple of links.
Okay, that's great. All right, Swathi; that's all the time we have for questions, but thank you so much for all the time you put in to prepare this demo and to really help explain and clarify how Kafka works with Qlik Replicate. I appreciate it.
Thank you, Troy; and thank you everyone for giving me this opportunity to share about Kafka. If there are any questions you have related to the Kafka configuration of Kafka Performance Tuning, you guys can reach out to us through Community and we'll answer all your questions.
Thank you everyone. We hope you enjoyed this session. And thank you to Swathi for presenting. We always appreciate getting experts like Swathi to share with us. Here's our legal disclaimer. And thank you once again. Have a great rest of your day.