Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Large amount of data (120 million rows) ... anyone got experience?

The most reports we build contain a couple of tables with some tables reaching 12 million rows.

But now we have a request for making reports for reports with 120 million rows with max 275 columns ...

Has anyone got experience with large amounts of data in QlikView? How does QlikView handle large amounts (loading time, user experience)?

rgrds Anita

14 Replies
Not applicable
Author

Anita, loading time will increase and so will response time. That was a joke, but with given information that's just about what I can say. Previously we built historical applications with "old" data so that we reach a good level of loading and response.

matt_crowther
Luminary Alumni
Luminary Alumni

Anita,

As the previous comment states it's difficult to provide specifics due to the myriad of different scenarios that large datasets can throw up, therefore I'll recommend a few generalisms that are more 'best practices' for most QV applications - it's just you don't NEED them normally whereas with larger models they're crucial.

Firstly attached are several documents that cover the subject of optomization and scalability which should give you the general pricipals need to tune any app you create.

From my own experience I'd point out the following:

- A large amount of data is going to be...large; there's no getting away from that we can only optomise and reduce so far so any hardware will have to be specced accordingly.

- Use autonumber() round your key fields as these take up alot of memory by their unique nature.

- Bear in mind Qlikviews compression method: 'Column Compression': 1million rows of 'Salutation' (Mr, Mrs etc) will take up far less memory than say 1milion rown of telephone data as Qlikview replaces each unique value it finds with a token and the fewer tokens the less memory is used. Therefore look to split things like phone numbers into 2 columns that way you get good compression on the Area code at least.

- Ensure the document is set to be 'Preloaded' into the servers memory after each reload.

- Don't be affraid to limit access to chart objects until a certain minimum level has been reached as Objects only take up memory when 'drawn' - ie in the objects 'Calculate Condition' box use something like 'getpossiblecount(My_Field)<100000'.

- You should use some form of multi-tiered architecture whereby a QVW extracts the data and generates QVD's for subsequent dashboards to quickly and easily load from; this stops say 3 similar dashboards that may take 3hrs each to load hitting your source DB in you reload window of 7hrs - have the 'mule' do it once in 3hrs then have the dashboards load from the QVDs in 3x 10mins.

- Once you get really large you may need to look at clustering.

Hope all that helps - let me know how you get on. - good luck,

Matt - Visual Analytics Ltd

Not applicable
Author

Very good contribution Anita.

Not applicable
Author

Hi,

We worked with one of the top retail client, we tested the load with 1.2 Billion rows. It took 8 min to load and the RAM size we used is 256 GB.

Thanks,

Selva

gerhard_jakubec
Contributor III
Contributor III

Loading approx. 80.000.000 (fact rows) on a virtual machine with 8gb of RAM creates a swap file of 4gb (still growing) already runs for 2 hours (still running) ... i am afraid we lack a substantial amount of RAM! but i have no idea how much will suffice our needs 😞

Anonymous
Not applicable
Author

Hi Matthew,

Thanks for contribution in very important topic still i do not get these three:

- Ensure the document is set to be 'Preloaded' into the servers memory after each reload.

i am not aware how to do this ?

- Don't be affraid to limit access to chart objects until a certain minimum level has been reached as Objects only take up memory when 'drawn' - ie in the objects 'Calculate Condition' box use something like 'getpossiblecount(My_Field)<100000'.

i do not know how to do it?

- You should use some form of multi-tiered architecture whereby a QVW extracts the data and generates QVD's for subsequent dashboards to quickly and easily load from; this stops say 3 similar dashboards that may take 3hrs each to load hitting your source DB in you reload window of 7hrs - have the 'mule' do it once in 3hrs then have the dashboards load from the QVDs in 3x 10mins.

1. you are suggesting to use three separate qvd generators for separate for separate dashboards.

can you explain it once again in some other words??


anant

anant

whiteline
Master II
Master II

Hi.

Roughly estimate gives about ~0,5gb per num field. I guess you also have some strings and id's, so it's not surprising.

Try to halving the data until it loads without a swap.

Colin-Albert

Gerhard - RAM is not expensive. 8GB is a small amount of RAM for QlikView especially when dealing with large data volumes. You need to allocate more RAM to your virtual. Also how many cpu cores have you allocated? The load process is multi-threaded.

gerhard_jakubec
Contributor III
Contributor III

Hi Colin,

thanks for your feedback. Currenty our virtual server has allocated 8 Gb RAM and 2 CPU Cores @ 2,93 GHz. The Qlikview load scripts finally succeeded after 2:30 which is acceptable for overnight reload of production environment. Average CPU load is low mot of this time. QVW filesize after reload is ~ 300 Mb (disk storage). I'm afraid access point performance will be poor until i get more ressources allocated. Also to check data quality i need to to reload all data after each major change within dev environment. We are in the middle of a major it restructuring and therefore won't get any budget for upgrades until we complete this migration.

Meanwhile my only option ist to rethink my current data model:

Since most of my data volume does not originate from our source systems but is derived within my qulikview load script. I wonder if i could somehow optimize / restructure my qlikview datamodel to a more condensed one. I have to provide daily numbers for employee engagements. Each row has fields for entry & leave date. In order to report the current number of employees i do an intervalmatch to create one fact recors per day between entry & leave date. This is why my qv load script pumps up my initial x0.000 fact rows (1 per employee) by the number of days this person is employed (average of 1.000+). This approach is also suggested to deal with SCD2 dimensions by Garcia & Harmsen (QlikView 11 for Devolopers, p. 427). My KPI for kpi_emplyees = count( distinct employee_business_key). Of course i could as well calculate the number of current employees for any given date if i just create two fact records per employee (one linked tho the entry date and one for the leave date). But i'd also have to create one extra fact row to properly link each changed scd2 attribute. Which definitely increases script  and data model complexity and will be hard to understand by end users.

Appreciate any suggestions / your experience.