We have lots of experience on large servers. We have QV apps that take over 80GB just to open them. When you get into servers of this size, the very large QlikView apps that run them, and you want peak performance, then everything about your processors, memory and memory bus becomes an issue.
For example, as you increase the DIMMs per channel (4 channels per processor), the memory controllers cannot operate at peak memory speed. There's no way around this. So 1600MHz memory can drop down to 1333MHz or even 1066MHz. At that point the peak throughput (i.e. 8.0 GT/s) that was advertised for your processors has just dropped up to 33%.
But if you attempt to change this by reducing your DIMMs per channel by using larger stick sizes, then your channel throughput drops by 25-50%. Memory controllers will span data over multiple sticks to make simultaneous access faster. Reduce the sticks and you increase the # of requests to get the same amount of data.
You cannot improve the situation by increasing the number of processors, because inter-processor communication in a non-NUMA product like QlikView will rapidly kill any memory performance improvement. 4 processors is the real-world max for QlikView, until it becomes NUMA aware, which I've never heard of on the roadmap.
One factor you didn't mention is CAS latency. A poor choice in memory latency can mean a 33% decrease in output from the memory sticks. You can't make that up somewhere else.
So overall, you can very easily start with an 8.0 GT/s QPI with up to 32GB/s of bandwidth and reduce throughput to 10GB/s BEFORE YOU EVEN TAKE THE HIT from QlikView's lack of support for NUMA architecture.
Hopefully this gives you an idea of the subtle complexity of large-scale systems. If you want peak performance under these conditions, you really need to work with a high-quality vendor that understands all of these tradeoffs and has intimate knowledge of their hardware. Buying whatever IBM, Dell and HP are selling on their websites will not get you to peak performance.
If the discussion is narrowed to a whether there are bottlenecks for the same server if there is 128 GB RAM attached, 512 GB or even 1 TB of it the answer is no. We have used several 512 GB RAM server with very good performance.
Modern server hardware is very good at eliminating performance bottlenecks that might occur when doubling or even quadrupling the RAM, through faster QPIs etc.
If your QlikView deployment (environment, application and usage pattern) is well performing and CPU is not continuously saturated I would expect adding more RAM to allow for much applications and allow for a larger cache with no need to worry about bottlenecks.
If you need to change server when adding RAM, then it's a different conversation as each individual server has its benefits and drawbacks and RAM is just one such parameter.
Hampus von Post
Hampus and Jay,
yours are both very insightful comments in their own way. Thanks!
I understand the trade-off between connectivity and sheer processor power and RAM size (for cache).
We are focused on having as much RAM available for the dataset as possible.
With all the data already in RAM the analytics workflow can be very lean. The current reference point for performance is very low as the legacy (not QV) solution runs off a totally overloaded SQL database. Both design and execution of charts takes tens of minutes when not hours. And extending the dataset for an end user is virtually impossible. So having 50% of theoretical maximum for Qlikview should be good enough if it brings all other advantages.
I have our HP vendor recommended set up for a 1TB server with HP ProLiant DL 580 G7 with 4x(10 cores/socket). The platform is also on a Qliktech whitelist in a parallel thread http://community.qlik.com/message/399492#399492
From what I gather from your posts - this is probably the best option we can do without ourselves becoming hardware experts first.
Do you see any flaw in my reasoning?
Yes those are valid benefits, but for the Intel E5-4600 product family there is another, more severe, bottleneck.
If you take a look on the Intel documentation at page 14 (figure 1-2):
or view this image to see the architectural overview:
The bottleneck is that with the four socket (CPU) E5 architecture all CPUs does not have a direct QPI link to all other CPUs. If RAM needed is not local then there is a 33% chance that in order to access it the CPU have to tunnel through another CPU in order to reach the CPU not directly connected. This will saturate the QPI links faster and introduce RAM access latency.
This is not the case with neither E5-2600 nor the E7-4800. The latter can be seen below where all CPUs are directly connected to all other CPUs (albeit through lower QPI speed).
From the E7 documentation in section 2.1.1 and following image:
"Each Intel Xeon Processor E7-
8800/4800/2800 Product Families are connected to every other Intel Xeon Processor
E7-8800/4800/2800 Product Families socket using three of the Intel QuickPath
In short then, yes the higher clock frequency and QPI link speed would be beneficial but here the fact that all CPUs are not directly connected to all other CPUs is a much bigger drawback than the benefit of clock/QPI.
That's why E5-2600 and E7-4800 are recommended. QlikView will obviously still work with the E5-4600, but tests confirm that performance is less that corresponding E5-2600/E7-4800.
If you go with E5-2600 v3 CPU.
- Do you look for higher base clock speed or higher number of cores to balance performance vs concurrent users.
- DDR4 at highest speed 2133mhz can be configured up to 512gb on this CPU's. Would the performance be better with lower amount of memory or at max 512gb. for example 384gb RAM @ same speed will be better performance compare to 512gb (meaning smaller amount is better for higher performance)
- if you are looking at running 1 big application (20gb x 3time = 60gb in memory) + many small ones ( about 7 to 8 of .2gb to 1 gb in size) with 30 to 50 concurrent users would this above setup work?
I have now installed the server with 1TB of RAM and loaded the data. There biggest tables in the star schema are 1,7Bln transaction and 0,7Bln customers. (just FYI autonumberhash256 is used for conversion of the common customer_id column (join) to a number when loading.)
The task monitor shows following CPU utilization during the runtime: The parallel tasks are distributed to all cores and the CPU utilization is 100%. For the tasks that seems not possible to distribute the execution time has grown accordingly. So for a simple filter selection on one table the computation is about a minute. Approx. of this time 20% goes to parallel tasks and 80% to single core.
So thanks for the answers in this thread again! We were able to make a decision and move on. The performance does not seem to have scaled linearly but it is not a total disaster either. In fact there is no other known system that can deliver the same performance for this use case.
May be we would upgrade the system to the 2TB of RAM. I wonder if we should choose a different processor...
I am looking also new server for our QV installation. Currently we are using IBM HX5 Blade with X5 memory expansion equipped with 226GB RAM. For new server my aim initial 512GB RAM with possibility to extend to 1TB min. Now some vendors are proposing LRDIMM's (Load-Reduced Dimm Technical Brief - LRDIMMs | Kingston) for memory. Does anyone has any experiences with LRDIMM's in QV environment? Is there any performance risks?