Skip to main content
Announcements
Qlik Connect 2024! Seize endless possibilities! LEARN MORE

Troubleshooting High RAM Usage in Qlik Sense

cancel
Showing results for 
Search instead for 
Did you mean: 
Troy_Raney
Digital Support
Digital Support

Troubleshooting High RAM Usage in Qlik Sense

Last Update:

Dec 1, 2021 5:12:13 AM

Updated By:

Troy_Raney

Created date:

Dec 1, 2021 5:12:13 AM

[Video embedded here. Use Insert Video button in the menu bar]

Transcript

Thanks for joining today's enablement session and today's topic is trouble how to troubleshoot the high RAM usage my name is Lisa and I’m working in Qlik Support as a senior technical support engineer so I sup I provide the support for a different Qlik product include Qlik Sense QlikView and NPrinting so now we will start providing troubleshooting for SaaS so today's agenda we will talk about RAM configuration setting in Qlik Sense and also what are the troubleshooting steps or high run usage issue and that will give you a typical example from our I mean happened from our client and the last is a q a session so if you have any questions you can ask me during q a so firstly we before we do any troubleshooting we need to understand uh what are the RAM configuration settings in the Qlik Sense so the first thing is a system requirement so this are the basic system requirement so basic basic requirement for CPU calls and memory usage so now all the Qlik Sense server hardware meet this requirement but we always keep in mind with the workload increasing then the hardware capacity also will be need to always be reviewed over the time so the run setting in quick Sense we can find from engine in QMC there are three uh configuration so why is the app cache time this app cache time is to let anything know how long time you want to you want engine to keep the app data in the cache so basically by default is minutes that means if after minutes if no one use access the same app then this app data will be deallocated from the cache and the minimum memory usage by default is percent and maximum is percent so this is by default this is the default exciting I do say many customers they change the society based on their business requirement and but keep in mind for even max maximum memory usage if  is normal but if consistency and percent then we do need to find out what caused this high usage all the time so we don't need to know what exactly happened whether it's healthy or it's unhealthy for Qlik Sense the memory what type of data is memory we can we can break it into two categories so one is necessary memory another is cache so for necessary memory we keep the data like uh metadata of the dashboard then I’m aggregate aggregated data set from the app and also each user's session data but in the cache we calculate and keep the aggregated data so all the calculation results will be will be kept in the cache and normally it is we try to keep this cache data as long as we can that means the engine can reduce the time to to keep doing the same calculation and that is how we we keep the high performance all the time so many customers ask how many how much RAM exactly the app I’ve consumed so the s normally um the s is a compressed file but once we load the data f data into the zen the run will consume like a to  times of the QVF file size for each user session then we will we will calculate around like percent or full of calculation data will be increased so I can give you the example later on yep so the process nge to load the data into the verb is when the user access hub and open the app then the data side of the app will be loaded into the lab and this is the necessary data and also the app metadata will be loaded into the RAM as well the same as the user session data then the user will open the dashboard make some selection but the ending will based on the selection and the font and expression doing the calculation engine next will be render the chart so the engine will access all the um f metadata and also the data model then do the calculation then keep the total value into the into the hash and also um soft half the dashboard finally so that is that was how engine loads the data into the cache whenever the user access app so this is the example like we have one user access app a and we can see this darker green is the app dataset data and this green is a session data and this light green is in the calculation data now we have user tool access the same app a you can say the same date app data site will remain the same so it will not double the app size into in the cache and the session data slightly increase and the calculation data also increase this is based on whether they are these two users they are making the same selection if they make  d if they make quite a different selection then the calculation will be largely increased then this is the user you can still see the same so the user  made a lot of different selection than the calculation data increase so keep in mind if you have more users access the same app but if they make different selection their activity is quite different then this calculation data will be largely increased yes so this is one thing I think you can always keep in mind whenever you invited this type of issue and it could be majority of the high rank usage from this calculation data if it only happens with a specific type so this is one way we can analyze quite quickly we we provide there are many different tools we can monitor the run usage so typically we use windows resource monitor we yes we use this to understand what process consumed that and also how much usage and also we want to see the CPU because CPU can tell us whether engine is doing calculation is really working hard to do the task reload so we also have a telemetry dashboard so telemetry dashboard is another app which can give you a lot of different visualization about your app and of f objects and how much them they consume so we have operation monitoring app this one I often use so basically this is a must tool I use to analyze this type of issue yeah it it gives a lot of different visualization from different angle we can we can analyze the concurrent user component apps concluding the tasks what exactly uh happening on with on the specific date specific time period so we can really do them to find the story um and also we have a lot of these other resources so I just provide some main user and this is a pdf file to give you very detailed information how exactly engine work with the memory and yeah so once you get my pdf you can get all this this link yeah um and this is also available from community so now how exactly we troubleshoot a high CPU usage typically when the customer complains about this the system are normally three type of system so firstly they may complain they are consistently high rank usage no matter when it could be lasted for a few hours it could be like a flow  hours so it's a very strange situation then another system could be the app whenever they open some app the user open the app or the specific sheet then that takes quite a long time but looks simple but actually not simple so we need to find out how many tasks are reload are doing reloading at that time how many users are accessing the hub at that time so looks simple again it's it's there's a big story behind a third one is a task yeah always um when run usage height we also need to find out why there is reality to the task so the task could take quite a longer time to complete and then why yes so there are so many possible protocols it could be related to the system architecture design it's not yeah so the design basically they need to review and and also the quick Sense server may be not a dedicated server and we need to find out how many what type of other software consume the results from the same server it could be caused by there are too many concurrent tasks and maybe some tasks reload for the live data maybe every five or ten minutes they have the same task result amazing caused by very poorly designed I may be caused by the app data set is quite large and the customer not aware of it so over the time this data set will just be keep increasing and there could be there are so many users concrete users access the same app and also there's a possibility the hardware needs to be upgraded I do say many customers they are not aware the hardware need to upgrade because the workload keeps increasing but quite obvious from all the data I collect the quick sensor server the CPU and the RAM cannot accommodate all this workload so in order to find the real root cause we need to do a lot of data collection yeah so firstly we need to look at from the high level over architecture we need to know the quick Sense version uh the system diagram so basically how many quick Senses server they have and where the quick sensor server located in cloud in virtual machine or maybe just a physical box on other client office are they aiming a networker load balancer device in front of Qlik Sense and and also water services enabled on each node so I will turn on windows services to say sorry this is a Qlik QMC service sorry so basically um this is very important part on each Qlik Sense node what exactly surveys are enabled gradual service proxy service engine service printing service so based on this enabled services we know what is the purpose what the job this node is designed to do and the windows environment how many call and how many total cp uh run run memory the next we we need to look at the QMC Qlik Sense server setting so I always do web conferences confirm this because yes because there are lots of information and I found the client could not provide all these information yes at once so it's always good to do web conflicts to find out look at their QMC node as it has a single node multi-node and what I know the detail is for production if production that means there's no development work that means there's no hard reload happening yeah and if this node is for that that means there are lots of have reload and find out how many developers do they also have the end user working on the same environment but normally normally for that environment there are quite a lot in the user that means let's have access and the both means for both yeah so that means all type of user can be online to access the same node then uh then the same how many um services what type of service is enabled on each node and how many so whether they have one dedicated scheduler node or o so that means the workload gets balanced so how many consumer nodes whether the node is only for authentication or for hub operation as well yeah so this is very important as well the login setting we have login setting our engine and repository I do say some customers they even are not aware they have this debugging mode turned on then the performance is quite low so it's it's quite necessary to check the login settings as well then the RAM setting as mentioned earlier what virtual proxy is used so this virtual proxy will define we define the authentication method and also very important we can get to know the user group so for instance you have developer you you have any user you have excellent user you have internal user so it could be the virtual proxy only for developer or maybe only for external users then we will know when the problem happens what group of the user are accessing the Qlik Sense server so this is very good indication information for us and the load balancing so the load balancing whether they have excellent load balancer to uh to to redirect the workload to different Qlik Sense servers have they said have a quick sensor load balancer I found that they even though the customer had multI nodes but they actually didn't accept the load balancer so all the workloads still still us dedicated on one particular node so it's good to know then we also need to find out what exactly consumed the high rank usage so so why does the quick sensor server is dedicated to the server normally how I found out is I will turn on the windows Windows service like this windows service then I will sort by the running status so I look at the running status to say what software are running at that time I do see quite often the customer even have a SQL server database or and also have other Qlik  products like QlikView and printing installed and running at the same time so this is not something we want to we want to hide um yeah and uh often the task manager to come come confirm the the exactly uh Qlik Sense process consume how much run usage yeah this is also important to know um yep and the antivirus is any antivirus software keeps capable in scanning why do these highly high usage issue only happen on specific node or all nodes now we need to locate the real issue which the is complaining so we need to ask the question is there any change on quick science server that means have they done any windows update uh sorry apply any windows update have they done the win uh quick Sense upgrading or apply any Qlik Sense patch either if you only had an ask this change so we always need to take notes what exactly happened on the server um and also the pattern of the issue so this iPhone is quite difficult for the customer to find but we always try to encourage to collect the information as much as we can and at the end we still need to use operation monitoring tool to find the mode so we definitely have very power powerful tool to find out all but this is the first hand information we expect the customer to to know as well yeah and also provide us so why did this issue only happen on specific notes happen on a specific time either at the time none peak time or particular time I can see some classroom they have allocated so many concurrent attack to be reload at verification time like from morning  to  they allocate  even more tasks to keep reloading and that is a peak hours for the business user access hub so immediately we know this is a bad design and the first step is they need to review the task to say if it is really necessary to have to reload in the morning yeah um and also particular time and any other time so based on this time we will go to the operation monitoring app to find more story and also get information from the log um why does this issue happen when open the specific app then then we need to find out the data size size of the artifact how big is that how many concurrent users open the app accessible whether this is a happen when reload the specific task also we need to find out at the same time how many other tasks are accessing how many how many users online are organizing the apps so all these are facts to impact this uh this performance and uh so again important to know the user group as I mentioned from the virtual proxy we can quite easily find out the user group uh virtual proxy yep so this is you what you proceed uh data source type and the location this is related to the task reload sometimes the network connection also take the time to to reload and that will keep the task reload time longer then definitely the engine will be basic and the RAM will not get released so all these things we need to keep in mind um yeah all this this information total we need to understand how many total users total concurrent users at peak time at the specific time when the issue happens how many uh in total how many tasks total concurrent tasks when the problem happens and the total apps total concurrent apps when the issue happened the reason we wanted to when I open the QMC I will look at the total number of this is is because I want to understand their workload so when they when this workload keeps increasing even though now the issue just started happening it could be happen quite often later on and and later and under um if we do not fix this then it will getting worse yeah so so again could be related hardware need to upgrade or maybe the  architecture need to get review get the exact time frame this is very important when the customer reports the issue they always not mention the time frame because without the time frame we cannot look at the log so we don't know what exactly happened um at that time so time frame is is the indication for us when we look at the log and also operation monitoring app yeah so now I have collected all these data and those data when I what I have mentioned is whenever I get this type of issue I I will collect yes so um if I need something then the investigation can go to the wrong path so that's why I put a lot of slides on the data collection yeah that is the first step to make sure we are on the right track now I have listed few scenarios on this high rank usage happening the customer the first scenario the customer containing high RAM usage happen when they access the specific app but it may not about the app there are a couple of root cause behind so we need to firstly find out if this is happening anytime or specific time yeah so we need to know the time pattern and we need to know find out are there any other windows software running any other tasks reload and how many tasks reload are there is it any quite large task so we need to find out the task reload duration and the maxI rate duration and average duration so all these can find out from operation monitoring ad and how many users access the same app again this we can find from operations under app data site user user type so whether it's internal user external user or public user or all because this user type can tell us the amount of the user total user and then we need to based on the task to find out the total concurrent user so there are a couple of root cause it could be single node is overload when there are app access and also task reload so could be this ad is very quite a large ad a lot of end user access but at the same time there are many tasks also scheduled to reload then the suggestion would be to read to spread the task reload at nine peak hours recently I just got one case with exact same issue they have plus tasks reload um yeah between and . it's very busy task schedule in the peak hours so um yeah first I think they have to review the task schedule and then if a single node if there are lots of um app um access sorry a customer sorry indie user access the same app and it's quite a large app and at the same time there are also many other apps that accessed then also a lot of tasker reload then they need we can suggest them to to either get more scheduler node or maybe possible more customer nodes so we we also can suggest them to set up the load balancer quick sensor load balancer among all these consumer nodes um also find out if there are any development work on the same node then maybe they can have a specific node for developer to work yes so this is how from QMC you can find the load task and the concurrent task number so this is a multi-node environment and this is quite a basic multi-node we have two consumer nodes we have two scheduler nodes and we have dedicated database and dedicated share folder so this is Qlik Sense load balancer yeah I often I say they even have multiple consumer nodes but they didn't set up the load balancing so that means still all the users actually they are accessing on one node only so another reason could be app design it's quite poor of data model design and also ideas we always suggest to minimize the front end expression because this will be calculated um when user access app on the engine to run the chart the dashboard and this is this definitely we all slow down so what as much as they can do this um to finish some calculation during um during data load so that means in the data loader script if they can define all this expression calculation then that can help on the app performance and and also to keep reasonable data set for the target users I do see the user the customer they have very large app and that app has the data for all departments in the company so like finance HR and under other departments over the years so that means this app contains all type of data in the company and these keep going um the customer not aware so I ask the question if all your different department user normally will access other departments data do they are they allowed to have their success the answer definitely no so I suggest them to separate the larger app into the small x one for trainers one for extra so for example and yes so this will give them very good instruction and the customer quite happy about it we yes so we we need to also educate the customer how to design the item we do have very good information and too how to optimize the app so I did mention telemetry dashboard you can find a lot of information from community so here I also want to bring up um bring up one bring up this um from our Qlik help this document can help you a lot so this gave a lot of information for about how to optimize your data model performance and the sheet performance so like in the data model how you should design your synthetic key whether it should be removed or you can still necessary to keep and what type of data model snowflake or star scheme a star model and also about the qed how you should segment the qed files and in the sheet we also have listed many functions so what a function can help you improve the performance and the word not you may need to minimize to use those function and so you see it differently I strongly recommend to review this and it's very good information okay scenario two is about the task reload the issue happens when the task reload so the root cause can can be there are quite a limited CPU cores in this case we need to so that means they have very less CPU costs but they have large amount of concurrent tasks so there is one way is to increase number of CPU costs and more CPU costs we can process more tasks and another is we can reduce the concrete task so they can spread out the task reload at a different time and also give some time gaps between the component task reload because the engine needs to take a little bit of time to release the task data from the memory um yeah so these are quite this normally very helpful um and the second reason is quite they are quite limited scheduler node for instance only one scheduler node uh one one single node doing half operation and also task reload then when there are more and more users um increase and also more more tasks the more apps then workload increase then definitely we have to review the assist the quick Sense servers and the design and see if we need to increase the scheduler node so this definitely help um also trying to minimize the live data load of course there is a business requirement but just need to say if really necessary has to be five minutes minutes and the number of the tasks so basically balance it out with the hardware results and also spray um yeah with the hardware results and the third scenario is um is about the increasing workload I have mentioned a few times and the customers they are not aware their hardware does not has reached the limit to accommodate all the increased increasing workload includes app access and task reload so the best way is to get a help from from partners or from from partner to review the system architecture and to say if their current architecture can meet their requirement and if not they increase the consumer node scheduler node or maybe separate the database uh have different dedicated database server um also they is a time to upgrade their hardware resource the last scenario is so there are still many other scenarios I just mentioned quite uh quite a typical scenario so this scenario four is also the example I wanted to mention I did say some customers they say they complaining saying Qlik Sense not releasing the memory why and you can see this example is I’m I’m so sorry this picture is not so clear but I I wish you still can see so they the van consistently like percentage so they wanted to know why this memory is not released and from the operation monitoring app we can see concurrent apps so these are happening like hours and the concrete apps there is zero all one only and concurrently users zero or one or two so quite a nice user access and uh um yeah concurrent apps also one two yeah so the question is why yeah again I I need to collect all the data so the Qlik Sense server and there are one central node four rim nodes so for these four rim nodes all the scheduler surveys are disabled that means there's no task reload happening on all these four ring nodes and one scheduler node so the scheduler this one central node actually is a master scheduler it doesn't do task reload as well so the all the tasks reload actually are on scheduler node only um each server they have gigabytes run and the minimum memory setting is the maximum is . so to me is is what they said our default is ninety percent yeah so um so far I don't think uh ninety percent is wrong so but why still we need to find out why all servers are in AWS in the cloud so um yeah that's a good setup I I’m not um I’m not a comment on this system architecture yeah but so far all these okay then two ring nodes are always occupied  percent run usage so these two ring nodes are among this four ring node so that means only half access um consuming this person so this is my first impression um okay now I keep collecting data I have they have four consumer nodes for hub operation because the scheduler service disabled and two consumer nodes are for authorized users that means these authorized yeah authorized users and other tools are for public users and the problem only happened on these two nodes which are for public use so this we found through operation monitoring app and then from then I need to locate um the currency they are run setting so I have cash time is by this they set up like two hours that means they inform engine to keep the date up data into the cache for two hours without if there's no any user access the same app then the app data will be released from the memory yes so that means we need to make sure within these two hours there's no user access the minimum memory usage is  and maximum is . so from this operation monitoring app I wanted to bring up my operation monitoring app give you a little bit so we can go through because this is a tool I use it must use the two otherwise I wouldn't be able to find all these information from the logs so I would I will say CPU usage so in their case CPU is quite low that means um that means there's not many calculations uh not many task results yeah and so I can confirm throughout other data then I will find out the RAM usage and the total reload reload average below the duration so if longer that means a large reload the maximum reload yeah and the maximum control users my maximum concurrent x and then this gave me a good overview last  hours activity so I could go through each ad then I may select some um date which the customer sorry which the customer has an issue so for instance I put bar November and I select a few days so this few days is based on the dates I don't have data so it's based on the date and the customer has a problem yeah yeah so I just bring up some data as example so here this is exactly the slide in my in my ppt so this is a RAM usage you can see uh the time period the different time period the time use the run usage can contact can load and consumer users and you can see the trend of the RAM and the CPU usage and again you can see the total reload and the reload so the task so in their case nothing to do with tasks so we can skip this and this also is about a task and session yeah so this is what we need to focus on in this example the session is uh we need to find out how many user access how many sessions yeah one user can create several a couple of sessions how many can printer apps and um and [Music] what are those apps quite large app or small app and if I suspect this app so you can see in my environment then I also have user quite actively access the this uh the apps then I may want to find out which type of high um high small session and then when I select this top two apps I can obviously say there are many um users access  hours because this is our internal server and then data so based on my investigation I found that they have top  apps but only two main apps have the highest session and also I can find there are continuous user session within hours so that means they have an anonymous user access the same two apps for hours then in this case the app will not the app data will not get released because the app data cache was set for two hours and in from business um aspect I can say this is quite good um good thing happening because a lot of public users access their app yeah so and also that explains why we want to keep run usage in the in the cache because this is how we can engine to improve the performance so keep a high performance to share all these ready data among all the end users so this is a slide I I captured so these are two apps and so when I look at these two apps I can say hours they are user access and public users I did check for seven days seven days data and I say only one Sunday the data um there's only there are few uh six hours seven six seven hours the the cash sorry the is no session then we also find out confirmed within this the app data gets released from the memory so from whatever we say this is quite a healthy environment and is normal but if they do not want to keep this calculated data then there is option they can clear out this cache so from the from the engine yeah we so this is one article how they can clear out the cash if they want yeah but otherwise this is very healthy so that means high you think we need to find out what caused this and the whether this is normal is all is a problem yep so I think all these yeah this is all what I wanted to deliver today if you'd like more information take advantage of the expertise of peers product experts community MVPs and technical support engineers by asking a question in a Qlikproduct forum hiding in plain sight is the search tool this engine allows you to search Qlik knowledge base articles or Qlik community forums help.click.com Qlik gallery multiple Qlik YouTube channels and more all from one place there's also the support space we recommend you subscribe to the support updates blog thanks for watching  

Comments
Manish_Kumar_
Creator
Creator

Hey @Troy_Raney
By any chance can you provide the slides being presented in this video!
That can be kept as notes for future reference.

 

Sonja_Bauernfeind
Digital Support
Digital Support

Hello @Manish_Kumar_ 

Let me see if I can get them for you 🙂

All the best,
Sonja 

justalkak
Partner - Contributor III
Partner - Contributor III

Hi @Sonja_Bauernfeind ,

if possible I would also appreciate sending the slides. Thank you

 

 

Sonja_Bauernfeind
Digital Support
Digital Support

Hello @justalkak and @Manish_Kumar_ 

I've gotten in touch with our team and we are unfortunately unable to share the slides from this presentation with you at this point.

All the best,
Sonja 

Contributors
Version history
Last update:
‎2021-12-01 05:12 AM
Updated by: