Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
maxim1500
Partner - Creator
Partner - Creator

Qlik Sense not using all CPU cores

Hi,

We recently started doing performance testing before our final deployment. We did some stress tests on a large application with +300M rows in the fact table. It is running on a dual E5-2667V4 3.2Ghz 8-cores / 128gb RAM server. We are running in a dedicated vmware environment (no other vm).

We simulated up to 50 users working concurrently on the server. We realized that the hardware was not large enough for such a project. Response times ran up to 6 minutes. On the following chart, users are added every 30 seconds.

50users.png

But when looking at the CPU, it barely reaches 65%. Some of the CPU cores are never used.

CPU.JPG

qlik.jpg

Memory is not an issue in this case. We barely use half of it. Disk and network don't show much usage. We feel the bottleneck is either the memory speed or the CPU. Any idea why some cores are left over?

Basically, we will need to run 300-500M rows on a relatively complex application with up to 200 concurrent users. After load, memory usage is around 40Gb. QVDs on disk are 26.3Gb. Any idea on the required hardware for such a thing?

Thanks!

1 Solution

Accepted Solutions
Not applicable

Hi, Maxime.

How quickly you ramp up users will definitely affect the result.

How you test and the total setup as well. Not many users will accept the response times in your results. The question is what is the performance with only 1 or a few users?

Longer response times than with few users is the result of queueing. In general Qlik is using resources as fast as the environment can deliver it. This hold true as long as the application design isn't preventing this.

The spec for the server seems like a good setup but there are still things to investigate when it comes to hardware. Virtualization is a layer between Sense and the hardware and that might have something to to with what you are seeing. There are many really good reasons for choosing virtualization but raw performance is normally not high on this list. It might have no impact in most cases but it can have a huge impact in certain situations.

Tuning both the physical hardware, the vm and windows is essential. Power settings, in both BIOS and Windows, is one of the key settings.

Physical memory configuration is also very important. Evenly distributed, called hemisphere mode by some vendors, and of the same clock speed. I've seen performance degrade more than 30% caused by unbalanced memory configuration. The intention was that 64 extra GB must be better, but with the physical setup the effect was the reverse.

The size of the document, once loaded, is not really very interesting. The important measure is how much it grows with clicks. The more complex your application is, both data model and visualizations, the more RAM is needed för storing the states and results.

There's a limit to how many users the repository can create and start sessions for per time unit. Trying to ramp up at a higher speed will only extend the already present queues.

Are you sure that the load client is properly handling the load it has to generate?

A test is testing more than just the QIX Engine.

Talking about concurrent users is unfortunately too vague. What they do is the key question. I prefer to focus on the number of clicks instead. It is very common to try to generate too many clicks per user AND ramping up the number of users unrealistically quick. Creating a test that will break the server is easy but it doesn't add value. Too many users too quickly with a document not loaded into memory will start with a queue of incoming requests and normally not recover. From a calculation perspective, it doesn't matter if it is user 1 or user 2 who makes the click, as long as section access isn't giving them different data sets. There are other things that comes into play with many users but once a session is properly established then their clicks can be seen as equal.

The only thing in your charts that is not "expected" are the three cores that doesn't use CPU at the same rate as the other. The QIX Engine will try to use all cores, if allowed. If all cores are checked for affinity in QMC then it will try to use them all. Being greedy is the way it is trying to complete a task as quickly as possible but if it has to wait for other parts of the environment then it will prevent it from performing.

Observing how the different processes consume resources is valuable. Note that a process can have high CPU but still wait for external resources. This can be observed as kernel-time in Task Manager and is a part of the xml-template that is included in the Scalability Tools.

Performance is the the "total" of the Performance Triangle where Application, Usage Pattern and Environment are affecting each other. The most important corner is Application (# of document, application design, data model and the visualizations) but the other two are in no way not important.

Application design is very important, especially when data or number of concurrent users are increasing. All objects pass through a phase that is single threaded. The duration of this phase is completely dependent on the fields involved and where they "live" in the data model and uniqueness.

If an application doesn't perform for one user then many users might only have the benefit of the shared cache. That's why I always start by testing one user to observe the resources consumed and what response time that will give. Then I will start to slowly ramp up users to observe the pattern. Running many tests with too many users with the response times observed in your chart is causing more harm, creating uncertainty, fear and doubt, than giving valuable information.

Choosing a testing methodology where the test is testing the intended components and not being throttled by other things.

The product is a well oiled just-in-time calculation engine but both hardware and application design can prevent it from running as smooth as it is capable of. Small external disturbances might impact more than often realized.

The so called stress-tests often cause more noise, so a way forward is excluding as many external dependencies as possible, until the scope and method is laser focused. Even then a stress-test is, in my opinion, not very useful. The most valuable tests are the tests leading up towards the larger tests.

I hope this inspire you to approach you testing in a slightly different way.

Note that 200 concurrent active users with an complex application with several 100k rows will consume resources, quite a lot of it.

/lars

View solution in original post

10 Replies
Gysbert_Wassenaar

Honestly, there's no way of telling when we can't see what kind of charts that app has and which expressions are used to calculate measures.


talk is cheap, supply exceeds demand
Not applicable

Hi, Maxime.

How quickly you ramp up users will definitely affect the result.

How you test and the total setup as well. Not many users will accept the response times in your results. The question is what is the performance with only 1 or a few users?

Longer response times than with few users is the result of queueing. In general Qlik is using resources as fast as the environment can deliver it. This hold true as long as the application design isn't preventing this.

The spec for the server seems like a good setup but there are still things to investigate when it comes to hardware. Virtualization is a layer between Sense and the hardware and that might have something to to with what you are seeing. There are many really good reasons for choosing virtualization but raw performance is normally not high on this list. It might have no impact in most cases but it can have a huge impact in certain situations.

Tuning both the physical hardware, the vm and windows is essential. Power settings, in both BIOS and Windows, is one of the key settings.

Physical memory configuration is also very important. Evenly distributed, called hemisphere mode by some vendors, and of the same clock speed. I've seen performance degrade more than 30% caused by unbalanced memory configuration. The intention was that 64 extra GB must be better, but with the physical setup the effect was the reverse.

The size of the document, once loaded, is not really very interesting. The important measure is how much it grows with clicks. The more complex your application is, both data model and visualizations, the more RAM is needed för storing the states and results.

There's a limit to how many users the repository can create and start sessions for per time unit. Trying to ramp up at a higher speed will only extend the already present queues.

Are you sure that the load client is properly handling the load it has to generate?

A test is testing more than just the QIX Engine.

Talking about concurrent users is unfortunately too vague. What they do is the key question. I prefer to focus on the number of clicks instead. It is very common to try to generate too many clicks per user AND ramping up the number of users unrealistically quick. Creating a test that will break the server is easy but it doesn't add value. Too many users too quickly with a document not loaded into memory will start with a queue of incoming requests and normally not recover. From a calculation perspective, it doesn't matter if it is user 1 or user 2 who makes the click, as long as section access isn't giving them different data sets. There are other things that comes into play with many users but once a session is properly established then their clicks can be seen as equal.

The only thing in your charts that is not "expected" are the three cores that doesn't use CPU at the same rate as the other. The QIX Engine will try to use all cores, if allowed. If all cores are checked for affinity in QMC then it will try to use them all. Being greedy is the way it is trying to complete a task as quickly as possible but if it has to wait for other parts of the environment then it will prevent it from performing.

Observing how the different processes consume resources is valuable. Note that a process can have high CPU but still wait for external resources. This can be observed as kernel-time in Task Manager and is a part of the xml-template that is included in the Scalability Tools.

Performance is the the "total" of the Performance Triangle where Application, Usage Pattern and Environment are affecting each other. The most important corner is Application (# of document, application design, data model and the visualizations) but the other two are in no way not important.

Application design is very important, especially when data or number of concurrent users are increasing. All objects pass through a phase that is single threaded. The duration of this phase is completely dependent on the fields involved and where they "live" in the data model and uniqueness.

If an application doesn't perform for one user then many users might only have the benefit of the shared cache. That's why I always start by testing one user to observe the resources consumed and what response time that will give. Then I will start to slowly ramp up users to observe the pattern. Running many tests with too many users with the response times observed in your chart is causing more harm, creating uncertainty, fear and doubt, than giving valuable information.

Choosing a testing methodology where the test is testing the intended components and not being throttled by other things.

The product is a well oiled just-in-time calculation engine but both hardware and application design can prevent it from running as smooth as it is capable of. Small external disturbances might impact more than often realized.

The so called stress-tests often cause more noise, so a way forward is excluding as many external dependencies as possible, until the scope and method is laser focused. Even then a stress-test is, in my opinion, not very useful. The most valuable tests are the tests leading up towards the larger tests.

I hope this inspire you to approach you testing in a slightly different way.

Note that 200 concurrent active users with an complex application with several 100k rows will consume resources, quite a lot of it.

/lars

maxim1500
Partner - Creator
Partner - Creator
Author

Thank you Lars! Very complete! This will definitely help! I will look at this with our IT Team and see where we go from there. Thanks!

rbecher
MVP
MVP

Hi Lars,

I'm facing massive performance issues with Sense v3.1 and larger data sets where CPU usage is very very low.

I looked for Core affinity in QMC ("If all cores are checked for affinity in QMC") but cannot find. Any hint where I can set this behavior?

- Ralf

Astrato.io Head of R&D
Not applicable

Hi, Ralph.

This doesn't sound as something configurable. I'm guessing data model in combination with the fields and measures used in one or more visualization.

There have been previous situations where this might be have been caused by bugs but these are very, very, few and I have not heard about any lately.

HiC's blog post about the Calculation Engine The Calculation Engine is my "map" to what is going on.

If you are in a situation where response times are long and CPU% is low, then I'm expecting you are probably "stuck" in the first calculation-phase for a visualization.

The engine will try to use as much resources in a short time period as possible, but some activities are not always easy, or even possible, to make multi-threaded without causing other issues.

It will take a long time if the fields used in your viz are located "far" apart and/or pass through many tables with many unique values as keys. Gathering all possible permutations of the values can take a long time and it is unfortunately single-threaded.

This is the reason for why count-distinct previously was believed to be single-threaded when the  real issue was the phase preceding the actual calculation was the real culprit. Count-distinct are often counting dimension over another dimension passing through the fact-table.

The solution is to help the engine assemble all the combination by changing the data model.

Any information in the logs that might be connected to the long wait?

Trust the fact that the behavior of the engine is quite consistent and predictable and when it deviate, then it is almost always app-design that is causing the abnormal behavior.


Hope this help you figure out what is going on.

Best regards

/lars

rbecher
MVP
MVP

Hi Lars,

thanks for your answer!

Believe me, it's not the app design nor data model. In most cases there is only one table involved and no calculated dimension nor any thing else than one straight count() or sum() measure on a field. No chart with more than 2 dimensions or more than one measure. All is very very simplified now (I've done my job optimizing an existing app). But still, espacially in the very first load, CPU usage (16 cores) is about 3 - 5 % only and charts got rendered in a sequence, not in parallel. I have no idea ..

- Ralf

Astrato.io Head of R&D
Not applicable

Hi, Ralf.

I apologize for the incorrect spelling of your name in my previous response.

The "one by one chart" does not sound "normal" or expected.

Do you get the same behavior if you create qvds from this app and test the same dim & expressions in QlikView?

Or an earlier version of Sense? Or another machine?

A good thing is that it seems to cache, at least...

How many permutations of values is the max in any of the visualizations?

My technique for separating symptoms from root cause/s is to minimize things that are in play.

Load the data model resident and create sheets with only one of the items and observe them one by one and keep an eye on task manager. Note that the cache might fool you so restart Engine quite frequently or use the method of all vs all but one and then all but another. You will get a better view with restarts but it takes time.

Identifying the visualizations that cause the most "damage" is important, both for diagnosing but also for finding a solution and validate that it worked.

1/16 = 6,25 so based on that it seems to be only one core. Are you seeing one core at 100% or are many cores moving a little in Task Manager?

One core normally do indicate app-design and the other pattern indicate other things, including hardware and BIOS settings and configurations.

Also check the logs for AAALR-messages. Seeing that is a clear indicator that even with a perfect data model and few fields the data volume and content is extending this single-threaded phase.

/lars

rbecher
MVP
MVP

Hi Lars,

thanks, I will gather more information. AAALR could be the case here. Is there a setting for Qlik Sense like in QlikView Server Settings.ini file DisableNewRowApplicator=0 ?

- Ralf

Astrato.io Head of R&D
Not applicable

Yes, to my knowledge this is QIX, so both QVS and Engine.

So it might be app-design after all... uniqueness of dimensions is included in the "app-design" concept.

Finding permutations is the single-threaded phase and many distinct dimension values is not what Engine is, by this setting configured to be, optimized for. Every "optimization" has a corner case that it might not be as efficient for, so changing the setting might cause performance degrades elsewhere. It is a server-wide setting so it can't be tuned per app or viz.

What worries me is the fact that you are seeing charts being rendered one by one where I would expect them to be rendered when done and with many objects some should arrive at the same time.

If you find the "slowest" object and exclude that, I wonder if that will allow for more concurrency.

I vaguely remember an very early internal Sense build that would keep small objects from being shown if a very large is taking time. They would all arrive at the same time once the large was done. Have not seen or heard of this behavior since... but still.

Let me know the outcome.

Good luck!

/lars