STT - Log File Analysis App for Troubleshooting Qlik Sense Enterprise on Windows

Troy_Raney · Jul 7, 2023 2:12:13 AM

Environment

Qlik Sense Enterprise on Windows

Transcript

Hello and welcome to the July edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session. Today's presentation is Log Analysis App for Troubleshooting Qlik Sense Enterprise with their own Øystein Kolsrud. Øystein, why don't you tell us a little bit about yourself?
Yeah, I am a software architect in R&D in a group called Customer First and we engaged primarily in investigation of customer cases that fall somewhere between R&S and Support. I spend a lot of my time navigating and analyzing logs from Qlik Sense. I quickly realized when I started this work, that working with the text files is not a very good way to get an overview of what is happening;. So, I started implementing a Qlik Sense app to see what I could do with analyzing logs that way, and that has been a valuable tool for me when understanding what is happening in customer installations of Qlik Sense.
Yeah,. So, today we're going to talk about what that Log Analysis App is; how to deploy it; how to navigate it; and Øystein is going to take us through a little bit of a troubleshooting demonstration of how it works. Now, Øystein how do you start investigating a reported issue?
Yeah, when we engage with customers, logs is the primary source of information for understanding what is happening in their system. And from the QMC it is possible to download a zip file with the set of logs from the most important Services in our system.
Okay, can you show us how to collect those log files?
Absolutely,. So, I'm now in in the QMC of a small test installation that I use myself. At the bottom here to the left, there is an entry called The Log Collector and if I click on this one, I get the option to state which date interval to collect the logs from, and some options for what to include. So, you typically need a case number or just some description. So, my log set let's call it now.
Okay, in a set period of time ?
Exactly,. So, 2 days of logs and then I just click Collect and Export, and QMC will then give me this information about log collection flow. And once it's done, I'll get a little zip file that I can then save to wherever I want; and as a customer you would typically send this to Support.
Okay, and that's collecting logs from all all the different Qlik Sense Services? or what logs is that collecting?
It collects logs not from every single Service but the main ones.
Speaking of those Services, could you quickly outline for us what those main Services are?
Absolutely. So, Qlik Sense has a number of different Services responsible for many different things (like license management and all this), but there are a couple of them that provides the core functionality of the product. We have the Engine Service, which is sort of the spider in the web for everything. This is the Service that takes care of all the calculations, all the heavy computations involved for displaying visualizations, and opening apps and all this.
All right.
That is really the main Service of the system, I would say. The others are there to support the Engine Service in one way or another.
Okay.
And the second service, the Repository Service, that is the one that is sort of a Gateway between all the other Services and the Database. One of the reasons why this is very important is that this is a Service that takes care of all the access privileged calculations.
Right. Okay.
So, when a user comes into the system there will be a request to the Repository to calculate which streams this user is allowed to see; that's a typical thing or which apps the user is allowed to see.
Okay, I expect those to be busy. What are the others?
Yeah, the Scheduling Service takes care of scheduling of tasks. So, this is the one that we'll look at the timing for different tasks and understand when they should be executed; for instance for app reloads, or data extractions, and things like that. And the Proxy Service is the one that takes care of the authentication part. So, this is the first service that you hit, but it's typically not a big problem. It's the one that will guide you to the right node and ensure that you are logged in correctly in all this.
And the Repository, that's where all of our configuration details and other settings are stored, right?
Exactly, when you go to the QMC you are primarily communicating with the Repository Service, because that's one that handles all the underlying information about your installation.
Perfect, once the Log Collector is run, and it's built that little zip file, how do you plug those into the app?
Yeah. So, the Next Step then is to extract this file that we just downloaded. So, I downloaded this Log Collector MyZip. So, I'll extract this to a folder and then take the template app and I'll make a copy of this, and I'll just rename it ”MyLogs.” Open this in Qlik Sense Desktop; and this is now just an empty app with the visualizations in place, but no real data. To the data load editor, we will find here a folder connection that I will point to the log set that I have just extracted.
I see the MyLog set folder there.
Exactly. So, here's MyLog set, click Save here. So, when I run Load Data, it will find all the files that is typically useful when analyzing logs.
That's great because the script is already written, the sheets are already prepped to load that exact data set from Qlik Sense, so you can jump into analyzing.
Exactly, it's based on premise that we get the logs in this format. So, that there's the hierarchy of log files at to traverse through.
That's all very simple. I appreciate you walking us through that. So, with this app, what are some types of scenarios that are good opportunities to use the Log Analysis App?
Of course. When we work with with a customer, there's typically some particular behavior that we need to investigate. One of the first thing to do is to try to figure out which Service is the one that is causing problems for this particular user.
Okay. So, how would someone get started?
Yeah, if we know that their system is prone to have a very poor response times from time to time then one of the first things I do is to go to a specific sheet that I have here which is called Threads per Node. And what this gives is an indication of the number of worker threads in the Repository processing the incoming HTTP request to the system.
Okay. So, if someone logs in, the Repository is calculating what to show that user based on what rights they have, and those worker threads are doing those calculations. Is that how that works?
Yes. Exactly. So, whenever I go to QMC or the Hub there's typically some form of request going to the Repository. For instance, if I go to the hub, there will be requests for listing the streams I can see and the apps I can see in those different streams; and all of those will be handled by the Repository as HTTP requests.
Okay. So, can you explain what we're looking at here?
Yes, in the top left corner, we'll see Available Threads; and this shows the number of those worker threads on the different nodes. So, this is a multi-node system, it will show the different threads available on the different repositories in the system.
And I imagine big spikes could be a bad thing?
Big spikes is definitely a bad thing; and I'll tell you why, and I'll focus in on one of the particular ones that have a spike here. Because what we see here is a behavior where the Repository is behaving fully normal for quite a while, we have this flat line here, and that means that Repository has its threads and they are coping with the incoming request just fine. But when the line starts to grow here, that means that the Repository needs to create new threads in order to keep up. A graph growing like this indicates that we have a problem managing the load in the system basically.
Yeah, and I imagine that that cliff where the spikes up and then drops right down, that looks to me like it crashed.
Yes. Exactly, and this is probably a case of the customer restarting the system; but sometimes customers will need to do this in heavy load system to stabilize the system again. So, when we get a case like this, we immediately know that the Repository Service is where we want to start investigating; and understanding why do they run into this and how can we avoid it.
Right. So, big spikes in the threads per node, that's telling you the Repository Service is having trouble meeting the demand that is put on it?
Exactly.
What would be the next step? How would you dig a little deeper to try and figure out what's creating that demand?
Yeah. So, the next step is to understand what kind of calculations is the Repository doing at this point. To do that, you need to activate a certain type of logging in the system. So, if I go here to the Repository setting; to the logging, there is a log level here for performance log level. If you set this to Debug, then you will get information about those HTTP requests that are being processed by the Repository.
So, it will have more verbosity and give you more detail about the communication that's happening there?
Exactly. So, when that is on, I will typically go to another sheet here which called Transactions here. This gives me the overview of all these requests that are being processed. There's a lot of information here, but here in this table, I can see those particular requests. So, here I can see for instance there's a quest for well on-demand app generation; I can see some basic operations here; we have some shared content access and such; but what I will typically start with are these pie chart here, because this shows me the Duration and the Counts for the different operations. So, in particular this one gives me the overall time spent on different endpoints in the system.
I see a pretty big piece of the pie there.
Exactly, in some cases this will stand out and it's quite obvious what to start focusing on, because what we see here is an endpoint called QRS/app/hublist/full.
And what does that do?
Yeah, this is used actually together with the second largest with this stream/full; these are the end points being called when you go to the hub.
And that is that the call that's trying to show you a list of apps you have available to you?
Yes, precisely. What happens when you go to the hub is: there is first a request for collecting all the streams you can see; and once we have that list for each stream; we will send one of these HubList/full requests to understand which apps is this user allowed to see in those streams.
So, if you see a big Rest Call here, and this is a pie chart for duration, so this is taking a long time –
Yep.
What would the user be experiencing, just like a spinning wheel?
Yes. When the user goes to the hub, it will take a long long time for it to open the Hub; basically this typically translates into very poor user performance.
I can imagine why this would raise an issue that people might want to get investigated.
Yep. An interesting thing is that if you're already in an app, and you're working in an app; you might not even notice this, because then you are not interacting that much with the Repository; you are working with the Engine Service.
Okay, is there a step further to try and identify exactly what we can do about that?
Yeah. So, now that we know that this is probably related to customers accessing the Hub; I'll typically have a look more detail on these Rest Calls. So, if I just select only these hublist/full Rest Calls; and go back to my table here, now I have this filter. So, that I see only these ones; then I can for instance order them by time spent on the individual calls, see if I sort on duration here
Whooh! 38 minutes!
Exactly. So, this is a rest call that took 38 minutes to compute, and that is probably due to the system being in a very high load environment. So, it took a long time to to get to a point where it could actually produce a result. But I can also get some insight into for instance how big is the response being sent.
What's the unit there? Is that this is bytes?
Yes, or characters actually. These are characters in that response.
So, is that like 55 megabytes?
Yes, exactly, the size of the response here for this request is about 55 megabytes, which is pretty big; and this indicates that there are quite a few apps in this system.
So, that would be something the highlight to the customer and get clarification on, to ask: Okay, how many apps does this user expected to see? And is that a reasonable number?
Yes, exactly, and I can also see here by looking at the endpoint itself; and this requires some understanding of these endpoints I guess, but when we go to the hub there are two different types of requests: we have these where we can see a stream ID, these are the requests for which apps the user can see in a particular stream. But then we also have this other one the Myspace, which is a list of all the apps that are available in the MyWork stream by this user.
Right, and I see a lot of those in the top of the list are in that MyWork stream.
Exactly. So, that indicates that some users have a very high number of possibly unpublished apps; but we can also see there's a stream here that has a very big response. So, it might be that they have one stream that it contains a very high number of apps for instance, which could also become a problem.
Well this has been really insightful. We've already learned how to identify some pain points that could be causing some user disruption; and this was an issue related to the Repository Service. You mentioned the Engine Service being a heavily involved. Do you have any examples for that?
Absolutely, if the Repository is doing well; then the next thing would be to look at the Engine. And then I have another sheet here which is called Engine Load; and this gives me an overview of CPU, and Memory Usage, and some health checks for the Engine.
On that graph on the top, are all those lines different nodes?
Yes, exactly. So, this again is a multi-node system, so the CPU looks quite okay, there's a spike around 12 pm there in the middle; but in general it looks quite okay. Spikes are not necessarily a problem. If however I had seen that a certain node is high on CPU for an extended period, then that is certainly something to pick up here.
Okay.
But what is interesting in this is spikes on memory. In the right here, we can see the percentage of Memory Utilized by the Engine Service on the different nodes, and here we can certainly see some spikes that look very problematic.
Yeah, a few drop-offs as well I see?
Exactly. So, we have this one on this node that is short-lived, but it seems okay; it's a very high spike, so it would be interesting look at that one; but this other node goes up to 100 here. This is not a healthy environment; and as you can see there's a very sharp drop which is probably due to an Engine crash due to going out of memory.
Okay, so what would be your next step in identifying what happened here?
Yeah, now we need to understand what the Engine is doing at this point. So, if I go back to the QMC, and show the Engine logging, there is a setting here: QIX Performance Log Level. This is typically set to Off or Warning; but if you set this to Info then you will get some metrics on all the endpoints being processed by the Engine instead.
Okay, is this something you'd recommend setting to Info just when you're troubleshooting an issue, or have it on all the time?
It is quite verbose. So, in in general we would recommend people to set it to Warning if they don't actually need it, and then just turn it on for debugging purposes or to get a view of what is happening and get some health checks for that.
Okay. So, we set that to QIX Performance Log Level Info, collected the logs, and loaded them into the app, what can we learn?
Exactly, then we can go to another sheet which is these QIX Performance, first one give some statistics, the second one gives better insight into the details of the different operations. So, here I now see the different endpoints performed by the Engine; and I can get some information about when it started, when it ended, duration, here we can see like a half an hour operation.
Okay. So, that Method column, is that all the processes that the Engine is performing?
Exactly. So, we can see that we have these DoReload requests here, and these are operations that are called when an app is reloaded. We have some pretty big ones here.
Yeah, using a lot of RAM, especially those top two.
Exactly. So, this one takes 400 Gb in memory which is quite big, but sometimes reloading apps can be very memory intensive. So, it's not necessarily a problem. If I were to investigate this now, I would check does this time range here correspond with when I'm seeing those spikes made the Engine crash; because then think about how to handle this in a different way. Perhaps put it on a separate node or reduce it, or find a way to schedule it.
So, it gets the resources that it needs.
Yeah, exactly. So, that's the DoReload part. There are some other requests that are very interesting as well; and that's these GetLayout requests. So, GetLayout is sort of the primary endpoint in the Engine that is used for performing calculations when you open an app and start looking at the sheet. For every visualization in that sheet, there will be one of these GetLayout calls; and that is basically the call that asks the Engine to perform the computations associated with this visualization.
So, that's actually generating all the calculations necessary to show the user a chart that's on a sheet?
Exactly, exactly. In this case, we can see that one of these call calls is also very very big in terms of memory.
And time. I see 4 minutes, and is that 123 Gb?
Gigabytes. Exactly. So, this is one visualization, in one app, and one sheet consuming almost 124 Gb of memory.
Yeah, that's pretty big.
That is pretty big. And it's something that probably would appear as spikes in memory. First, I will check: does this correlate with what I'm seeing in those memory spikes on the Engine? And if it does, I would certainly go back to the customer and say: Hey, you have some visualization here that you need to investigate and take care of, because this is making your system unstable.
Okay. So, then that you would basically ask the customer to take a closer look at that visualization; see if there's any calculations in there that might be improved; perhaps they have an IF statement that's –
They can be all kinds of things.
Yeah.
Exactly, what calculation is being performed, that's not something I can see from the logs; but there certainly some operation there that is very heavy. And sometimes it can be simple things like: they have some dynamically computed dimensions that just needs to be moved into the script, and not perform this calculation during the rendering of the visualization. So, sometimes there are tricks to handle this that makes a much better user experience, and also makes for a much healthier system.
This has been really insightful, but these have been some very specific examples. What if you're taking a general look at the logs to get an idea of what's happening?
Yeah, most of the times I will also end up analyzing the system in a more general way. And in this sheet here: Log Details, I have sort of the raw overview of the different log entries in the system. These different bar charts at the top here just shows the number of messages per time unit, Day right now. And then the stacking differs between the different bar charts.
Can we take a closer look at each of these?
Yeah, I can. So, if they start with the right one here. Here I can see the messages stack per Service. So, if I only want to look at Engine messages now; for instance, I can just click on Engine, and I have my log set filtered on that, and I can then investigate my logs for the Engine only, which is a great way to start. Another typical thing I would look at is this one on the left. This shows me all the messages stacked by what type of message.
Right.
So, if I just want to look at Errors and Warnings, I can use the filter on that here and see if there's anything interesting in the log files based on that. And I can also now see which Services are actually producing errors. Is it the Engine or the Repository or where are these coming from? Most of the messages here are coming from the Proxy; and that's typically login issues. But if I go to just Errors, I can see that we have the Repository and Engine dominating here. So, then I can do my filtering in that way and see if I can find a pattern to it.
And what is that chart in the center?
Last one is stacked based on Nodes. If I now select Errors, and then I see one node only here, then I know that it's that node that is suffering and something is happening there; and then I might want to focus on that node to see what is happening. Sometimes I can get insights already here that I can see spikes of activity. So, here I can see spikes of Errors around late in the evening, and then I might want to go and look at.
Okay.
What happened around there; So, for instance here we have a node that is having a lot of error messages 10 o'clock.
Okay.
So, what happened at that hour, and then I can go back to my sheet and see all these messages even came within the same second. So, something happened here at that time that triggered a lot of error messages. May or may not be of interest to what I'm looking at, but sometimes things will just stand out here in these bar charts and allows me to quickly focus on area; and where things are of interest.
If an admin sees an issue like that with some error messages keep popping up, and they're not quite sure what that error message indicates, do you have any suggestions on how they can investigate further?
Well there's always Community of course. Many of these messages will have questions on Community. So, if you search there you might find interesting information; but if you see this pairing regularly and see it correlates with some bad behavior in the system otherwise; then of course contacting Qlik Support and get information about it is also a way to go.
So, where can someone find this?
Yes, download the app. It's available in a Support article; also with some information about how to install it, and how to use it, all the information needed to get started will be found here. But it also relies on you having Desktop to run it, because the way we traverse all these log files is something that is only available in Desktop.
Okay, now it's time for Q&A. Please submit your questions to the Q&A panel on the left side of your On24 console. Øystein, why don't we take them from the top? They're already come in a few. First question: does this app include a timeline where we're able to get a more descriptive error messages?
Well, the log sets are definitely organized by time. So, there are certainly sheets here that give you information about how the messages are organized in time, and I typically look at these bar charts here. And these are all like drill-down Dimensions. If I want to focus here on 23rd of June, I can click on this, and then I can look at the different hours there, and I can narrow down in on the case where I see the information. So, for instance if I want to add 8 am, then then I can look at it per minute. Here's something interesting: suddenly a spike of activity at certain minutes, but that's the kind of sort of outlier. It sometimes catches your eye.
Right. Great. Next question: is this app supported?
Not officially; but something we use in our daily work. So, if there are requests or ideas then we are welcome to take those and take them into consideration.
All right, next question: what monitoring or tenant analytics apps are available for Qlik Sense Cloud?
Well the analysis of logs and such is very different in cloud. In cloud, it’s Qlik that is hosting it all; but there are some apps available for doing analysis like app usage and things like that. So, there are some options there, but it's different from the on-prem version.
Exactly, the whole thing with cloud is not to have to worry about the back-end architecture. So -
Yeah, exactly. That is maintained by us. So, it shifts the responsibility back to to us.
Great, thanks. All right, next questions: kind of related back to the Log Collector. Does the Log Collector collect log files from all of the different nodes in a multi-node environment?
Yes, that's one of the benefits of having like a feature, it's one click catch-all kind of thing. It will pick up information from all nodes; it will certainly collect that.
Okay, next question: where can we get the latest version of this app?
Well, it's published through a Support article. It's available here on Community. So, if you search for Qlik Sense Log Analysis, and this is one of the hits you'll find; and from here you can download the latest version; there's also some instruction here on how to get started; and how to use it. So, you should find the information you need here.
Great, and we'll definitely include the link along with the recording of this video, once it's all posted. Okay, next question: are there any NPrinting Report Remplates available for this app?
Not that I'm aware of. And the way we use this from Qlik is to troubleshoot primarily. So, we really don’t have the need for doing that type of reports, but I could imagine people who would like to use this more for like a system monitoring purposes might need that, but that's not something that I've experimented with or seen.
Okay. So, we don't provide them, but it's certainly possible. All right, next question: where can we download Qlik Sense Desktop?
Qlik Sense Desktop is available through the standard Qlik Product Download site. So, well there along with the other products that you can download. So, there is this article again on Community, and that gives you some guidance on how to do that, even a little movie that you can look at to see how it works; but there's basically a download page that you can go to, and then you can filter on the different versions of the product there, if you can find the different different versions. If I go for Qlik Sense, and see if I can find it: products Sense Desktop is there, then you can download there.
Great. Nice seeing that demonstrated. Okay, next question: does the Log Collector pull from the archived logs folder?
Yes, it does it pulls both from the Archive Log folders and from the Local log folders on the different nodes as well; that's sometimes important, because sometimes when a customer reproduces a problem, they want to grab the logs immediately and send it to us; but the different local nodes will contain the logs that have not been rotated to the archive logs fold yet. We try to capture everything and collect it. So, to get the full picture.
Perfect. We’re a rolling along. Next question here: is there any kind of logging to track user activity within an app, like what sheet chart and field they click on?
Not directly, but what I typically do to track that type of information is: the QIX Performance logs. In this set of data, I can track the different endpoints being used by a certain user. So, this is my test environment. So, only me on this, but if I then look at for instance: let's see active user here; I should be able to find, here's me. So, I can click on me here, and then I can see what kind of activity I've been doing through this time frame. So, when a customer complains that they were experiencing (I don't know) slow access to a certain sheet, then if we get this type of logs, then I can quickly go to the this type of data, and filter on that user, and try to see if I can identify exactly when this was happening, and why that user was experiencing a poor performance.
That's really cool how you can see actually all the processes that are triggered when user activity starts. And it's also ordered by time, and you can filter it by app.
And right now I've filtered by Memory Usage, but certainly, click here and I can filter on when, and then I would typically see the flow, as you could see in the websocket traffic when you're working with Qlik Sense.
That's great. All right, next question: do you have any tips for managing or reducing the size of log files when not actively investigating something?
Not really, but I guess the recommendation we have is: to not have Debug logging on or if you don't need it; because those are usually very verbose and something that can put a strain on the environment in some cases. But when you run the Log Collector, you will always specify a time frame for which logs to collect. So, then you will get a subset of the information which is typically much more manageable. So, you've already done that type of filtering by time before you get to this app to start analyzing them.
Great, next question is a little bit lengthy: can this app help us measure performance between the nodes? We're trying to make sense of the following log message: QRS Security Management retrieved privileges and 263 milliseconds from the QRS. Is that the time it takes for the QRS to talk to the Database?
Yeah, it's kind of specific to a specific log message, but I do think I know what this message is about. If I remember correctly, this is a message that is produced by the Engine; and I think it's done so in connection when the Engine requests information from the Repository about what type of objects a user is allowed to see and use; and in particular, I think it's associated with these specific endpoints here the CreateSessionObject, because then when this is called, the group Engine needs to request if it's okay for this user to create the type of object they are trying to create; but it's not directly a request going to the Database. What this is is: the request to the Repository; and then the Repository will have to evaluate Security Rules associated with this. So, that is what this message says: that the privileges from the QRS, then the QRS might again access the Database to get the information it needs. So, one of those calls might trigger multiple Database requests.
Right, well that's interesting. And I'd recommend if there's ever an error message or a log message you're not quite sure what it means; just Google it or post the question in Qlik Community in a forum. That's a lot of people out there that might have a better understanding of what those messages indicate. Okay, moving on, next question: can this app help identify the source of a performance problem inside of an app? Like what object could be causing an issue?
Most definitely! And that's actually one of the things we use it for the most. The sheet I'm looking at here is probably the main go-to page for doing that, because here I can see the activity in the Engine. And I can use that to identify for instance if there's a particular visualization that consumes a large amount of RAM; or if there is a specific time when some of these actions took much longer time than expected for instance. So, I use this a lot for troubleshooting exactly that type of performance problems.
Great. Last couple of questions: do you expect this app to be delivered as part of the product in the future?
There's no plans for that at the moment. It's kind of something we are developing on the side, and it's not following the development flow of the product in general. So, thinking of foreseeable future; at least this will be available on Community, like we have demoed today.
Sure. I always like to highlight the Ideation page on Qlik Community. If anybody has any product improvements ideas, Product Management definitely takes a look at that input from our customers and partners. So, take a look at Ideation on Community if you’d like to post a suggestion. And our last question today: can you import this app into Qlik Sense Enterprise on Windows or do you have to use Qlik Sense Desktop?
You can certainly import the app, but the problem is loading the data; but if you have loaded data and want to share it with someone else for instance, then you can certainly upload it. But the problem is that this app needs access to the folder where you have all the log files. So, you need to set up a folder connection. If you do that in the on-prem version of Qlik Sense, then you need to probably move this log set; for instance, the share folder of the Qlik Sense installation; which you can do of course, but it might be an extra step that is sort of tricky for many users. You'll need Admin access in a different way then.
Great. Øystein thank you so much for showing us this app and ultimately for developing in the first place, but making it available and demonstrating it for us. It's (I think) it'll be very valuable for people in the future. So, hopefully it can get some more attention and wider use
Yeah, thank you. I hope it's something that people can benefit for. I've certainly benefited a lot from it, but it takes some time to get familiarized with the logs and how to interpret them, but it's certainly worth the effort if this is something you need to do from time to time.
Okay, great. Thank you everyone. We hope you enjoyed this session. And thank you to Øystein for presenting. We always appreciate getting experts like Øystein to share with us. Here's our legal disclaimer. And thank you once again. Have a great rest of your day.

STT - Log File Analysis App for Troubleshooting Qlik Sense Enterprise on Windows

STT - Log File Analysis App for Troubleshooting Qlik Sense Enterprise on Windows

Transcript

Analysis

log file

Qlik Sense Enterprise

Support Techspert Thursdays

Techspert Talks

troubleshooting