Skip to main content
Announcements
Qlik Community Office Hours, March 20th. Former Talend Community users, ask your questions live. SIGN UP
cancel
Showing results for 
Search instead for 
Did you mean: 
cwolf
Creator III
Creator III

Bad performance from QV 12.30 / 12.40, very bad performance on HPE Gen10

A few weeks ago, we moved our QV server cluster to new hardware and what we get was a dramatic performance loss, especially when selecting in list boxes.

Since this loss of performance was also present on the desktop, I did some tests with it.

The result is very frustrating.

Not only the new hardware is a problem but also the QV versions from 12.30 onwards have a significantly worse performance than version 12.20.

The test environment:

Old Server:
HP ProLiant DL380 Gen9
CPU: 2x Intel Xeon E5-2667 v3 (3200 MHZ, 8 Cores)
RAM: 16x 32GB DDR4-2132 LRDIMM
OS: Windows Server 2012 R2 Standard

New Server:
HPE ProLiant DL380 Gen10
CPU: 2x Intel Xeon Gold 6254 (3100 MHz, 18 Cores)
RAM: 12x 64GB DDR4-2933 LDIMM
OS: Windows Server 2016 Standard

The application I used has a working set of 21GB after opening and reaches a peak of 33GB during the test.

For the test, I use a macro that simulates the behavior of a user.

The test consists of 2 parts. First, a sheet with a dashboard (38 diagrams) is activated. Then a selection is made in 6 fields one after one.

The results (all values in seconds):

HP ProLiant DL380 Gen9:

 12.20 SR1012.40 SR3
Dashboard58.26481.594
Selections45.885132.595
Total107.119240.83

 

HPE ProLiant DL380 Gen10:

 12.20 SR1012.40 SR3
Dashboard127.255168.244
Selections90.418181.604
Total222.189365.444

 

With Version 12.30SR3 I got the same bad results as 12.40.

A 2 times worse performance of 12.40 compared to 12.20 and with Gen10 the performance deteriorates again twice!!!

Our NPrinting server is also a Gen10 server. We use NPrinting only with local qvw files. Daily there are 78 jobs scheduled, all at the same time. If QV Desktop 12.20SR10 is installed, NPrinting needs  about 35 minutes to finished all jobs. If Version 12.40SR3 is installed, more then 4 hours!!!

Only the Publisher has a very good performance on Gen10.

???

- Christian

3 Replies
Frederic_De_Ranter

Hi Christian,

The degrade you are seeing between QV 12.20 and 12.40 is not normal and I would highly recommend you to open a support case for this. 

I have recently tested a server with 2 Intel Xeon Gold 6254 and compared it to a server with 2 Intel Xeon E5-2687wv3 (which has a few more cores than yours, but lower clock speed) and can say that in most test cases the new CPU has much better performance. One test case where I test an app that is specifically designed to have a lot of single threaded operations, the older CPU is the better performer. Generally for single threaded operations, the servers with less cores (and higher turbo boost clock speeds) will perform better.
To figure how the sheet in your app is using the CPU, you can open the task manager on the server, change to the Performance tab and change the view to logical processors. Then you can see what happens when you open the sheet and see if all cores are working all the time or if there are moments (before the sheet is fully displayed) that there is only 1 or 2 cores working at 100%. 

Some more things you could try out: (quickly took a look at your BIOS settings file):

  • I was wondering why you didn't enable hyperthreading? In most test cases, this will give a performance gain. (except for very single threaded app designs)

  • For the NUMA settings, HPE made it slight confusing with 2 separate settings: Node_Interleaving and  NUMA_Group_Size_Optimization.
    For turning NUMA completely off: Node_Interleaving: enabled ||  NUMA_Group_Size_Optimization: Flat
    For turning NUMA on: Node_Interleaving: disabled ||  NUMA_Group_Size_Optimization: Clustered

    I would recommend you to try with NUMA completely off first since in general that will give the best performance. Of course there are again test cases where NUMA on actually gives better performance.

  • Change the power settings so that the CPU does go to higher c-states (lower power) since this will make single cores of the CPU reach higher turbo clock speeds. The turbo boost clock is limited by the power dissipation of the CPU, so if some cores can go down in clock speed, others can go higher. Example of the settings:
    Minimum Processor Idle Power Core C-state: C6
    Minimum Processor Idle PowerPackage C-state: Package C6 (retention) State

  • Another test to do is to disable some cores in the BIOS, I've seen that with single threaded calculations, disabling cores can give you much better performance. 

Best regards,
Frederic

cwolf
Creator III
Creator III
Author

Hello Frederic,

thanks a lot for your response. I have done some tests again on HPE Gen10.


1. Basically, all cores work when calculating the dashboard, one permanently having 100% and the other between 60% and 100% with 12.20 and between 90% and 100% with 12.40.

2. With hyperthreading, performance decrease again about 30%.

3. Node Interleaving / NUMA. My point of view:
The parameter NUMA_Group_Size_Optimization is only meaningful if there is more than one NUMA group. The parameter Sub-NUMA_Clustering is decisive here. The DL380 Gen10 has 2 CPUs and each of them has 2 memory controllers. If all 36 cores are enabled, the following applies:

  • Node_Interleaving = enabled => NUMA completely off
  • Node_Interleaving = disabled & Sub-NUMA_Clustering = disabled => NUMA on: 1 NUMA group with 2 NUMA nodes (per CPU)
  • Node_Interleaving = disabled & Sub-NUMA_Clustering = enabled => NUMA on: 2 NUMA groups (per CPU) each with 2 NUMA nodes (per memory controller) and only for this case the parameter NUMA_Group_Size_Optimization is important.

However, for the tests with the QlikView Desktop, NUMA brings a minimal improvement. For the QlikView Server with NUMA, a very fluctuating performance can be felt. For dashboards from very fast to very slow and for selections from ok to extremely bad.

4. Your 3rd point concerns the CPU feature "Core Boosting". The Xeon Gold 6254 does not support this feature. I can save power by setting C-States, but I don't get more performance.

5. Only switching off cores really brings a performance gain:

 12.20 SR1012.20 SR1012.40 SR312.40 SR3
 36 Cores18 Cores36 Cores18 Cores
Dashboard127,25587,496168,244115,837
Selections90,41863,372181,604145,088
Total222,189154,306365,444302,194

 

Would it mean that it makes no sense to use a CPU with more than 10 cores for HPE Gen10 ?!


Anyway. The performance of the Gen9 cannot be achieved on the Gen10. No matter what I set, the performance of 12.40 is always worse by a factor of 2 than 12.20.

Best regards
Christian

Frederic_De_Ranter

Hi Christian,

About NUMA: we've seen from our own and customers experience that NUMA_Group_Size_Optimization is actually really important to be set correctly even when Node_Interleaving = enabled & Sub-NUMA_Clustering = disabled.
Maybe HPE has updated their BIOS to change this since I have not tested this lately.

The Xeon Gold 6254 does support clock core boosting (as do all second generation Xeon Gold scalable processors). (https://ark.intel.com/content/www/us/en/ark/products/192451/intel-xeon-gold-6254-processor-24-75m-ca...) And the time and speed that the clock of a single core can be boosted is very dependent on the amount of power the CPU can dissipate. To simplify; we can state that a CPU can maximum use its defined TDP (for your CPU that is 200W). If some cores can be lowered in clock speed and thus dissipate less power, others will be able to boost longer. But of course if your CPU utilization is showing that it is using all cores (and thus already reaches its maximum power dissipation), no cores will be lowered in clock speed and there is no extra room to boost more.

Disabling cores will give a similar effect: since less cores are dissipating power, the other cores have more room to boost.

If you are doubting the performance of your Gen10 server, I can recommend running the HW BM package for QV 12 which you can find here on community: https://community.qlik.com/t5/Qlik-Scalability/QlikView-12-Hardware-Benchmarking-Package/gpm-p/14786.... You can then send in the results you get and we will send you back a report, to compare it with a similar sized server.

As stated in my previous message, I have tested this CPU in a server from another vendor with the HW Benchmarking package and it is performing according to its capacity. E.g. for a heavy load test with the Hardware Benchmarking application:

 

CPU

Settings

Average Response time (ms)

Average CPU Utilization

Xeon Gold 6254

Htoff

3190

84%

Xeon Gold 6254

HTon

2363

89%

Xeon E5-2687Wv3

Htoff

7887        

94%

Xeon E5-2687Wv3

HTon

5289        

93%

 

Of course the data model and the application design are also very important when it comes to getting optimal performance. I would recommend you to open a support ticket so that all aspects of this issue can be looked at.

Best regards,
Frederic