Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi there!
I have huge fact table 300M rows and one small mapping table 50K rows, both in CSV. Simple scenario - I need map RegionID from map table, and aggregate fact data by grouping three fields.
I ran script below on Qlik Sence Server (192 GB RAM; 12 cores). Script used up ALL memory (for process Engine.exe), started to use swap file extensively, and after 4.5 hours (!!!) was terminated by me because I've lost my temper.
While waiting I uploaded the same tables on SQL Server 2016 (32 Gb RAM; 4 cores with lower frequency), build 3 indexes (1 hour) and run simple query JOIN fact on map_table and then GROUP BY (50 min).
Total executioin time: 1h 50min
Whats wrong with you, Qlik ?
More seriosly:
Any directions how to debug this issue?
I did simple ApplyMap and GROUP BY, why in hell it ate so much memory?
Maybe I just should NOT use Qlik for ETL tasks when having SQL Server?
Thank you in advance.
My script here:
FactCSV:
LOAD
@1 AS StoreId,
@2 AS GoodsId,
@3 AS SellingSum,
@4 AS PurchaseSum
ApplyMap('Map_RegionID', @5) AS RegionSomeMappedID // Map_RegionID above loaded as Mapping table (id, RegionSomeMappedID)
FROM [lib://fact.csv]
AggregatedFact:
LOAD
StoreId, GoodsId, RegionSomeMappedID,
sum(SellingSum) as SellingSum,
sum(PurchaseSum) as PurchaseSum
RESIDENT FactCSV
GROUP BY StoreId, GoodsId, RegionSomeMappedID;
Looks like all 5 fields in your AggregatedFact table are also in your FactCSV table.
So this will be attempting to generate a 5 field synthetic key, which depending on your actual data could end up being huge, gobble up your RAM and take ages.
Maybe change your AggregatedFact load to something like this :
AggregatedFact:
LOAD
StoreId,
GoodsId as AGGRGoodsId
RegionSomeMappedID as AGGRRegionSomeMappedID
sum(SellingSum) as AGGRSellingSum,
sum(PurchaseSum) as AGGRPurchaseSum
RESIDENT FactCSV
GROUP BY StoreId, GoodsId, RegionSomeMappedID;
Even if this loads in a reasonable time I doubt it will be your end game, as the only association between the tables will be on StoreId. I expect you may need a composite key to associate your 2 tables.
Or maybe if you only need the aggregations you could simply drop the FactCSV table at the end.
Looks like all 5 fields in your AggregatedFact table are also in your FactCSV table.
So this will be attempting to generate a 5 field synthetic key, which depending on your actual data could end up being huge, gobble up your RAM and take ages.
Maybe change your AggregatedFact load to something like this :
AggregatedFact:
LOAD
StoreId,
GoodsId as AGGRGoodsId
RegionSomeMappedID as AGGRRegionSomeMappedID
sum(SellingSum) as AGGRSellingSum,
sum(PurchaseSum) as AGGRPurchaseSum
RESIDENT FactCSV
GROUP BY StoreId, GoodsId, RegionSomeMappedID;
Even if this loads in a reasonable time I doubt it will be your end game, as the only association between the tables will be on StoreId. I expect you may need a composite key to associate your 2 tables.
Or maybe if you only need the aggregations you could simply drop the FactCSV table at the end.
Thank you, Bill.
Your point was right. Now aggregation time is on par with SQL Server.
One more thing is bothering me... My SQL server runs on much worse hardware than Qlik, but the aggreration time is on par. Does Qlik use parallelization on these tasks?
Yup, in Qlik script a Group By will run single threaded. Whereas an RDBMS would often multithread it.
Group By aggregations will always be heavy on machine resources be it Qlik or database and one has to make a considered choice of which machines' resources to hammer. If it is a Production database then taking load off it can often be preferred as otherwise End User OLTP performance can suffer, plus if you have RDBMS licensing per core the financial difference can be huge.
Some years back I gave a presentation to the board and said I estimated offloading such workload from the Oracle database servers to the Qlik server would save them approx GBP 5 million over five years. Which was quite a claim that the CFO picked up on and asked the CIO what he thought of this claim. The CIO responded that it was probably in the right order of magnitude. The board approved the Qlik project and have not regretted the decision.
Notwithstanding that one of the strengths of Qlik only becomes apparent when data is loaded at the transactional level and not pre-aggregated, as then the End Users in the dashboards can dynamically slice 'n dice and select dimensions interactively in ways that quite likely would never have been thought off whilst specifying what aggregations are needed.
Doing this with 300 million fact rows would need a well designed and tuned data model & front end GUI. But I have a Qlik app with over a billion fact rows and it performs adequately and the End Users love it. It does have a fair stash of RAM and processor cores for Qlik though.
Thank you Bill for your answer, really helpful!
Money always matter)
My real problem is that underlying transactions fact table is way bigger than 300M rows, actuallly it's just a monthly increment. I have serious doubts that raw transactions should be put in the Qlik even for one year in this case. Memory amount needed for such cluster will be very expensive. Hense, I have to aggregate the data.
Ok, Group By will run single threaded. What about the formulas in measure calculations when user makes a request in app? Are they run single or multi threaded?