STT - Monitoring & Oberservability In Qlik Sense on Windows

Troy_Raney · Mar 17, 2023 4:07:37 AM

Environment

Qlik Sense Enterprise on Windows

Transcript

Hello and welcome to the March edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session today's session is Monitoring and Observability and Qlik Sense on Windows with our own Mario Petre. Mario why don't you tell us a little bit about yourself?
Hi everyone. It's good to be back here with everybody on a new performance and traceability topic. I'm a Principal Technical Engineer in the Lund Qlik Support team. I've been with a company over 8 years. I've been working with the Qlik Sense platform from the early days, and I've always liked all of these performance and scalability topics.
Yeah. We love having you share with us about all the different tips and tricks when it comes to monitoring performance of Qlik Sense. Today we're going to talk about what centralized metric collectors are available; we're going to take a look at a few of those, specifically Butler SOS. Mario is going to do a demo of that; we'll dive into another short demo of Grafana Loki; and we'll talk about what that can do for you; and we'll certainly save lots of time for questions. Now Mario, you did a great Techspert Talk about optimizing and measuring performance; how to establish a Baseline; and the metrics to pay attention to. This is an example of a multi-node Qlik Sense deployment. Could you quickly go over what all these acronyms mean?
Sure. So, real-quick; these are naming conventions that we use in the Qlik Sense Client Manage product architecture So, QES is the Engine Service; the QPS is the Proxy Service repository; and QSS for Scheduler. Then we have our shared storage and Central Repository Database; that can be set up in different ways in a multi-node environment especially large at scale environments; in those situations is where centralized metric gathering becomes even more crucial.
Yeah. And really, it's just a ton of log files, so any way to orchestrate and make that a little more manageable is certainly attractive. One of the first tools we're going to look at was Butler. What does that look like or how does that apply to Qlik Sense?
All right. So, Butler comes from one of our deer luminaries at Ptarmigan Labs. It is a node.js built service that sits on each and one of your Windows servers and gathers metrics and signals from the Qlik Sense operations as well as the machine itself; and sends these back to a central collector place, which can then feed a number of different databases. To give an example, we have log collectors installed on each machine; these then talk back to a central monitoring service that is the actual Butler SOS service running inside Docker. This will then feed these metrics into a time-series database of your choosing. In my case, I went with InfluxDB; it can also send events to Prometheus; to New Relic; they can send events to a message-queue broker like MQTT. This can in turn be used with automation platforms to orchestrate actions based on these signals received. So, it's a very interesting utility that opens a whole bunch of modern techniques that are more cloud-native than on-premise.
Okay. So, basically (so I understand it), Butler - it's an event log handler, and it takes specific metrics from Qlik Sense and it sends that to other services that are able to kind of filter that and visualize it for people?
That's right.
All right, and what is it the architecture look like for the Grafana Loki?
The other example would be Grafana Loki. Here the architecture diagram is a little bit simplified; we would have your Qlik Sense site that is producing traces logs and metrics
Okay.
The Grafana agent in this case would be Promtail, and the Grafana agent itself on these machines.
Okay.
Each local collector will be gathering these metrics and sending them to a central Loki container; this will aggregate your metrics; tag them; label them and then at the very end of it, we have the visualization layer or how to query these data sets, and that would be the Grafana tool.
Okay. So, with Grafana Loki, where would those services need to be installed?
Looking at a very simple setup; so, you have your central node, your consumer and two consumer nodes, you would install a Promtail log monitor-
Okay.
Configured for each of the services that you want to monitor, and then all of these would be communicating to a central Grafana Loki repository.
Okay. So, that's just an extra service you install on each node that sends metrics you specify to Grafana?
That's right. And then in Butler's case, the picture is very similar, but we have a different agent installed on each of the nodes and one more layer in between. We are choosing to store our metrics in InfluxDB, and then query them with Grafana.
So, what does the demo environment look like that you're going to be showing today?
The setup on my end is very simple. It's all virtual machines. We have 3 Windows machines; one is a domain controller as well as the file server host for the service clusters. The Qlik Sense setup is one for Prod, one for Development, but both are storing data in the same place; and they are both sending metrics to the same virtual machine that is running Docker. And inside Docker, we have stacks for the Butler SOS program; we have another stack running in parallel for the Loki, and a couple of utilities like a Portainer to manage it all in a nice and easy way.
Okay. So, this is your Qlik Sense server, and you've already downloaded some files from the Butler SOS setup site in GitHub. So, what should we do with those files?
We will have to place these in a specific path inside our Program Data folders, so that the log listeners can start forwarding events to our Butler SOS instances. And just as an example, let's see how one of those definitions look like.
Okay. So, you're opening up a Config file that's telling Butler what it needs to be listening to?
Exactly. This Config file definition tells Butler how to listen for events coming from the Scheduler Service; how to differentiate between different columns within our log files; and how to prepare and send those over to Butler; and the only thing that we need to customize here is the Butler Remote Address; in this case, Butler-SOS.domain.local; this is the virtual machine running Docker.
Okay.
And the corresponding listener port for the user events.
Great.
So, once we have this saved, there are four different files that we need to care about in the Program Data folder corresponding to each of these services; and we'll take the Engine Service as an example. We see that along with the default files and folders, we also have the local Log Config. This is configured for engine specific events, but we have the same Config up here. The IP address and the listener Port.
So, you need one of these set up in each service that you want to pull metrics from?
Yep, and Butler SOS at the moment works with these 4 Services as well as machine metrics.

Okay. So, can we now go to see how all this gathered by Butler SOS looks in Grafana?
And here we are.
And I see in the URL the Grafana login page uses Port 3000.
That's right. Now I've done a little bit of initial setup, and the maintainer of the Butler SOS project has been kind enough to provide some templates that you can simply copy and paste.
Okay. So, over on the left, you can browse some different dashboards that are available. And are these those templates you're talking about?
Yep, that's right; and this is one that I created myself based on one of those templates from GitHub.
Okay.
And just to see how easy it is, I can simply click Import here.
Okay. So, this is GitHub where you got the templates?
And from the GitHub project, I can go into the docs, Grafana, and here we have the latest version of the dashboard.
Okay.
We can grab it in Raw format; select it all; copy it; and then bring it back into your Grafana dashboard window; paste; and load.
Ah! That is super easy. Thank you for walking us through that.
Yes, it is very easy. And select the data source which is InfluxDB, and click Import; and here we have the dashboard.
Wow, that was quick. I love right off the bat that there's links to the QMC and the Hub and then you can select the time frame you want to look at.
Absolutely. So, this is one of the key strengths of Grafana and time-series based databases: the speed at which you can recall historical records is unparalleled. So, we are looking by default at the last hour, but let's say we want to bring in more data; let's take a look at 24 hours. That was it. It's all loaded.
Okay. So, we're looking at Free RAM, and CPU Load Per Server; and this is all the Main Metrics?
Yes.
What else do we have here?
We also have Applications in Memory; and you will see this going up to 1 and then perhaps zero, as I have reload jobs in the background kicking in. So, this is truly real-time data.
Okay.
You can also see a list of applications in memory.
Okay.
If we scroll down further, you can see a little bit of activity levels per node. So, active users vs total users.
Okay. So, these are open sessions.
The beauty is that as you can see the cursor is tracking across multiple objects, and it is very easy to zoom in into a time-frame and get more detail.
So, you just clicked and dragged and it zoomed in on both those charts simultaneously?
That's right; and not just those, but actually all the metrics and objects get refreshed based on that time range.
Okay. That's really cool. I love how visual it is.
Absolutely, absolutely; and they're all drag and drop. You have this button here that puts it in Kiosk mode so that you can have it walk through a series of metrics automatically on a public display for example if you just want to monitor the health of your platform.
Okay. We've got Sense running; Butler is in between; it's configured to pull specific metrics it's pulling taking on the health of the server; it sends all that to Influx, and now we're viewing that the interface is Grafana?
That's correct.
So, how could someone really utilize this in a good way?
One of the main benefits for me: it's just a visual aspect of it.
Yeah.
Being able to take a bird's-eye-view over longer periods of time; and take a user report for example of slowness or something like that; and start to put it in context. Is it slow now? Am I seeing an increased CPU usage? or a RAM spike? For example, when this user complained and you can zoom in on their specific timestamp range and start to understand across the environment what else was going on? And of course, always contrasting this with historical data. All of these questions are much easier to answer once you can actually browse the data instead of having to perform this analysis from scratch from raw logs every time you want to answer these types of questions.
Definitely, and then nobody wants to comb through raw log files. So, this is a huge advantage over that.
That's right. To give you a simple example, just 1 user logging in and opening 1 app and making 1 selection produces something to the order of 200 or so log events.
Can we demonstrate that? It's like open it up an app and see what happens?
Absolutely. That is perfect transition into the next system, which is more oriented to raw log file parsing and then aggregating that again through Grafana as a visual medium, but in Loki as a database storage instead of InfluxDB this time.
Okay. Great! Let's take a look at Grafana for Loki then.
And here we go. A very similar Welcome setup. Now here we have Loki as a data source.
Okay.
And it is listening for the local Loki service instead of Butler on port 3100.
Okay. How is this setup?
We're on the same server, Win-server-1. This is the central node for the Prod environment.
Okay.
We also have to tell Loki where to look, and what to look for in terms of logs.
And how do we do that?
You have to download the Promtail executable.
Okay. That's the Promtail exe. This is all available in Grafana's GitHub? And I see the Config file there. What needs to be changed in that?
The default ports are incompatible with Qlik Sense.
Okay. How do we find a port that isn't already in use by Qlik Sense?
Choose one that starts with 91 in front, and if you want to make sure that you don't have any conflicts, you can always check our documentation.
Okay. We’re on Help.Qlik.com looking at ports.
Behind the Broker Service is where most of these will be living.
Okay. And just pick one that's not listed here? Like you said. starting with 9100. So, what else in the Config should we be aware of?
So, we have some definitions; we have a job name that we can use to separate between different jobs while consulting the data in Grafana; and I am using the Static Config; and defining some targets. The target is localhost; that means we are monitoring a local path, and this works with UNC paths as well. I am calling it Engine Logs just to differentiate; and the engine log folder happens to contain both files with .log and .txt extensions.
Okay.
With this path definition, we are making sure that we are going into the Engine Service Log folder, but expect to see subfolders there. Promtail uses ** to give us folder name wildcards; and * for file name wild cards.
Great, but basically this is the Config file for all the logs you want to pull, as opposed to in Butler where you had a unique Config for each service. This one does it for everything?
That's correct. So, this will generate a lot more data which is why the Static configuration for the scraper has to be fleshed out a little bit more to 1) avoid overwhelming the centralized logging destination; and making sure that we only really capture those events that are important to us.
Right. Let's start exploring some logs using Loki.
Yep. So, we have a very seemingly basic interface, and it will already have selected your default data source, but it can be very powerful.
So, how should you start to build a query?
Let's select the label; and these are labels defined within the Loki configuration that you'll be able to see online. One of the custom tags that we had is the Host. So, we can say show me everything that is coming from Win-Server-1.
All right.
Within the selected time-range; run that query; and here we can already see a lot of logs from that time. We can expand the time period to something like 6 hours. It will automatically refresh and show us a little bit more; and as we scroll further down, we will see the actual log contents that are coming from the server.
Okay.
And some useful information about what's being presented: about 1.5 MB worth of logs, looking at 6 hours, the received logs cover 2 hours 43 minutes out of this time range, and we're limiting it to a 1,000-records on screen, although you can have many more. And if you remember Troy, we have only really configured Loki to monitor Engine and Proxy logs.
Okay. So, that's user level activity on an app.
This is still Grafana; so drag to zoom and the results would automatically refresh down here below; but let's say you want to start looking for a specific session ID. So, we'll copy this session ID; and we'll go Line Contains Case Insensitive; paste in our value; and run the query again; and let Loki do the magic.
Wow. So, it picks up everything from that session and lays out a timeline of logs for it. What if you wanted to track the activity of a specific app?
Yes, we can go ahead and do that. We'll go ahead and jump into our QMC. We can come to our apps list, the app ID for the What's New App; we want all jobs that come from Engine regardless of their server origin; and we want to search for the What's New App; and paste it into our tracing. Here we go; and we see some historical activity for the app ID as well; and we can see all of the events for this particular app ID from different users. How you choose to visualize this is up to you, but the possibilities are endless, especially with proper scraping of the log files; and we'll have some starter templates ready for you when this content goes up online.
What is the tool that you're using to replicate user activity?
The utility called Selenium IDE Selenium is a toolkit for web developers that allows us to record user actions and play them against the same application in different browsers to test compatibility, functionality, response times, Etc.
Wow.
So, I have this project defined already on another app; and this is basically going to run through these actions for me.
Okay.
So, it will open an app; it will click around a couple of things.
So, what did you do to create this that we're looking at?
That is the best part of Selenium IDE! Selenium IDE is basically just a browser extension. So, once you click on the browser extension, you can start recording. Let me bring that over here. So, you will be clicking around; recording; opening things; and you can see it's adding actions to the left here; and it will replay those in the exact same order.
That's really cool. So, you basically record a session when you're in an app; and doing things; and then you could play that back to simulate user activity?
Absolutely. This can be done just to test all the browser versions, Etc. I just wanted to be able to replay user activity. So, that we can look at that in Loki and see those wonderful events streaming in as we interact with applications.
That's great. Before we get to Q&A, is there anything else you wanted to highlight?
Oh, yes! The other major integration in Butler SOS (backtracking a little bit).
Okay.
I've showed InfluxDB. Another output mechanism is New Relic.
Right.
So, this is my New Relic account.
Okay. So, this is an option instead of influx DB, you could send it to something like this?
Yep. And New Relic is an online service. They give you 100 GB of storage for free, and up to one user account. If you want to grow your team beyond that, it's a paid service; but for testing things out, it's perfectly fine. And we can take a look at the query your data section; Butler will be uploading upon your configuration a whole bunch of interesting metrics.
I see a bunch of Qlik Sense stuff there.
Yep. So, all of these are coming from Butler SOS. We can take a look at the CPU total for example; we can group them here by server name; and we get a nice little graph with CPU usage. We can use a different metric for example: Memory Allocated for these servers, and we can order it by different tags. These tags come from the configuration file for Butler. It also has dashboarding capabilities; see some examples here. If I go to my dashboards, so it's looking at Engine; selections Free Memory or Total Users; and the amount of Engine calls over time.
And I love how it still has that simultaneous clicking thing where if you are selecting something in one graph, it mirrors it in the other.
Exactly. So, we can take a look here what might have happened during this spike. So, we can select the spike, and then it will tell us the corresponding metrics everywhere else.
Awesome. Okay, Mario, now I think we need to move over to Q&A.
Right on.
Okay. Now it's time for Q&A. Please submit your questions to the Q&A panel in the left hand side of your On24 platform. We already got a lot of questions coming in Mario. So, I'll just read them off from the top. First question: where can I find information about data model and app optimization?
All right. Well actually, we did a Tech Support Thursday on this very topic. I would encourage you to take a look at that first with our good friend Dan Pilla.
Yeah. Optimizing Qlik Sense apps with App Analyzer. That was a great one. All right, we'll definitely include the link to that.
The same techniques would apply for Qlik Sense on Client-Managed and QlikView for that matter.
Okay. Next question: are there any security issues with this kind of metrics being publicly available?
Right. So, this is an excellent question, because we need to sort of define the visibility for these details. I mentioned a couple of things about the New Relic aspect, because out of everything that I've showed today; New Relic is the only external destination for these metrics. Everything else stays in-house; whether that is your actual physical on-premise or your virtual private cloud in somewhere like AWS or GCP, that would still be your servers under your control with your network security and policies in place. For the New Relic part, the Butler maintainer has included a setting where you can choose to obfuscate the user specific details for the user session data that goes out to New Relic. For everything else, you can choose how to prepare or obfuscate that data. For example, with Loki you could have a configuration parameter that looks at the User ID column of each log name and replaces that with Redacted for example. But of course, the more you redact out of these traces, the less useful they become over time. So, I would say as long as you keep this data internal, there is no risk. The same risk would apply for sharing logs with anybody else. We trade those logs with Qlik Support constantly; and we don't have any security breaches, but use your best judgment. All of these tools allow you a lot of freedom when designing where you store the data; how it is retained; how it is maintained over time. So, asking these questions before you deploy is the best place to kind of design the solution the proper way. Excellent question. And of course, you wouldn't want to send these details over an open network pipe across network segments; anybody could be listening in, extract sensitive information out of these logs, try to correlate and put together a mapping of what your users are doing. You are always in control of where this data flows, and how it flows there in the different components that I've showed; absolutely customizable.
But again, all these metrics that are coming from Qlik Sense Server; they're not actual data from your apps. It's just how they perform, like the Number of Users accessing it, the amount of RAM it takes up, that kind of thing, right?
There's –
That's, right; that's right.
No actual data from the app is accessed.
But we know that different countries have different data protection laws in place; and Germany (for example) wouldn't want to have any kind of personal identifiable information on web logs.
Yeah. There are User IDs potentially.
Exactly. So, you can choose what to obfuscate; you can choose what to drop out of those events; and sanitize that data before it goes to the “online data store.” And that online data store (like I said), it could be sitting right next to your Windows Server; it could be sitting somewhere up in the cloud; New Relic is an online service entirely for example; and it is pre-configured. So, that you can choose to obfuscate those details.
Okay. Let's move on to the next question: what metrics should we be monitoring to avoid performance stoppage or critical issues?
All of the signals from the Engine and the Proxy would come to mind first; and we have details on these in Qlik Help. So, that you know how they are printed in our logs. I would keep track of first and foremost Machine-Level Metrics. The Operations Monitor for example would look at service specific resource usage, but not the whole machine. While these metric collectors would look at the full resource consumption; take a look at CPU and RAM usage; at your page file usage; your network stack usage; disk usage in general; and then moving on to the Qlik Sense realm, how many users are logging in. So, sessions created; sessions closed; actions performed; memory bytes added. For the Engine part: how many concurrent users; how many applications open per node; all of these metrics exist already in our log files; you just have to find a way to expose them in these aggregators.
Okay. Great. Kind of related to that last question, is it possible to do something like this with the Operations Monitor App?
That's an excellent question. I think these tools that I've shown today and also in the past with establishing baselines, etc., the Operations Monitoring, the monitoring apps in general are useful from a Qlik Sense operation standpoint. That's why they're called the Operations Monitor. It tells you what actions have been occurring on the system, and how many errors; but when it comes to troubleshooting, the cause of these errors, they are not really useful. You still have to go hunt which log files contains those signals and try to put them; correlate them in in aggregate; understand what happened between all of the services. They're not very useful in in my opinion. As a health check dashboard for operations over time on Qlik Sense, and as an adoption dashboard, I think they're an excellent start, but for the more technical oriented of us that have to care for very large environments, they lose their effectiveness very quick.
Okay. Next question is: where can we find information about setting up and configuring this?
All right. Everything that I've shown including how to get started with the basics is available on the respective websites for Butler SOS, Grafana Loki and New Relic on how to get started there. Additionally, I will have some instructions on how to set up Portainer as a separate stack, and how to install Webmin.
Great. Yeah. We'll definitely include all of those links along with this recording.
Yep. We'll make sure to drop all of the configuration examples as they were defined in this exercise, because I think between the environment diagrams that we showed at the beginning of the webinar, and all the IP addresses, server names that I've used; it should be fairly easy to follow what goes where.
Great. Next question: are these tools publicly available and free?
Yes, and yes. And that goes across the board with limits of course. In the case of New Relic, there is a storage limit after which they'll want you to pay; there is a user limit after which they want you to pay. When it comes to Grafana, to Butler, all of these are self-hosted of course. So, you would need to pay for the machine cost of running those; but otherwise, no. And most of these (going beyond public and free) also free as in free software, open source.
That's right! Next question: will these kinds of tools work with QlikView as well?
Absolutely. In the case of Grafana Loki, for example it's just a general log parser. You can have it parse all kinds of logs and in fact. The QlikView Server log specifically, and the Distribution Service logs are quite a bit simpler than some of the logs that we have in Qlik Sense. Parsing those wouldn't be that big of a deal.
Right. Last question: is it possible to get a metric on response times from apps? We're trying to investigate if a performance issue is Network or Qlik Sense related.
Absolutely, you can do that from a couple of places. And in fact, these different places will help you understand whether this is a network or a compute resource problem. The first thing that I would do is to turn on the Repository Trace logs to debug. This will give you Repository performance data on how long repository calls take to execute on the Repository Service. The Repository is in charge of taking in and responding to all of the governance requests that come from Qlik Sense. So, are you authorized to log in? Are you authorized to see this app? Are you authorized to see this object? And as you hear, as I'm going deeper and deeper into these actions, we do authorization calculations based on your security rules for pretty much everything that an end user interacts with in Qlik Sense.
Okay.
The Repository Trace debug logs give us that performance view. You also want to look at the QIX Engine performance logs. Those will tell you performance for calculating objects on screen for users. Marrying these two details together can at the very least give you a sense of how fast Qlik Sense itself is at responding to these requests. You can then compare the experience on Server vs the experience from a Client-Machine. Having an understanding of how these metrics respond on both, you can start ruling out or ruling in the network aspect. So, Proxy logs, Repository logs, Engine logs; and we have information on all of these online on which columns matter; what do they mean; and how to start building Intelligence on top of it.
Great! Mario, thank you very much for this. I'm sure it'll help a lot of people. It's great to see what's out there, and available especially when it comes to trying to get a grasp on all the different metrics that are coming from a multi-node installation.
All right. As a primer, I think it was very fun, but it's only an introduction to what's out there. The tools available go way beyond what we showed today. What I do want to encourage Qlik Sense administrators to go out there and explore; see what works for their organizations; and let's have continued vibrant discussion about this topic on Community. Thank you all very much.
Okay, great. Thank you everyone. We hope you enjoyed this session. And thank you to Mario for presenting. We always appreciate getting experts like Mario to share with us. Here is our legal disclaimer. And thank you once again, have a great rest of your day.

STT - Monitoring & Oberservability In Qlik Sense on Windows