I have a load script that is attempting to create a summary table from detailed data. The memory usage seems to explode exponentially, (and gets excessive above 1 million rows), but there is only one resultant table (i.e. no joins causing cartesian products). Anyone have any ideas why this is the case?
Here is the basic script (many of the numeric column have been removed to abbreviate the script):
Detail: First 500000 LOAD Item_Key, Store_Key, Time_Key, Sales, Costs, : 42 more measures : FROM [2011 Sales Weekly.qvd] (qvd);
inner join LOAD Time_Key, Billing_Year, Billing_PP, Billing_Period FROM [Dim Time.qvd] (qvd) where exists( Time_Key );
Inner Join LOAD Item_Key, Vendor_Key FROM [Dim Item.qvd] (qvd) where exists( Item_Key );
Summary: Load Item_Key as Item_Key, Store_Key as Store_Key, Vendor_Key as Vendor_Key, Billing_Year as Billing_Year, Billing_PP as Billing_PP, Billing_Period as Billing_Period, Sum( Sales ) as Sales, Sum( Costs ) as Costs, : 42 more measures : Resident Detail Group By Item_Key, Store_Key, Vendor_Key, Billing_Year, Billing_PP, Billing_Period;
Drop Table Detail;
STORE Summary into [2011 Sales By Period.qvd] (qvd);
I'm not seeing the cause of the problem. You might need a noconcatenate when loading the summary table if it has the same fields as the detail table. In that case, QlikView would concatenate them together, and then when you drop detail, drop the whole thing. But I don't see how that would cause a performance problem, just a results problem.
What was even more strange is that when I removed the Exists clause on the joined tables, it would return a cartesian product, which makes no sense given that it is an inner join to begin with.
For a while there, I was having all kinds of problems in that I set the First command to 10 and it still would blow up memory. Really strange...I am wondering if the number of columns that I am summing is causing QV to lose it's head.
One other thing I do differently, in case it matters - I always explicitly specify which table I'm joining to rather than letting QlikView figure it out for me. I want to KNOW that I'm joining to the table I intended, not just hope. So I'd be saying INNER JOIN (Detail) everywhere. I can't think how that would make any difference, though, as the table to inner join to seems obvious (it is, after all, the only table that exists).
50 or so columns isn't a lot of columns, so I can't imagine that's the problem either.
Thanks for the reply. As it turns out, it seems that there is something going on in the way QV handles Group Bys. As I add more columns with which to calculate a sum on, it adds a tremendous amount of memory overhead.
Grouping 11 million rows over 8 key fields down to 8,245,570 records I see the following pattern:
Summing 2 Metrics: 3.783GB
Summing 3 Metrics: 4.133GB
Summing 4 Metrics: 5.004GB
Summing 5 Metrics: 5.372GB
With a caveat on the last one because I was running low on memory.
Anyway, you can see that each incremental metric adds a tremendous amount of memory. Once the operation is complete, however, it quiesces down to 429mb (with 5 metrics).
With the 42 metrics that I need to add, it would take approximately 22GB of RAM just to create a summary table, and no telling what would happen as I add the remaining detail records, which will be probably another 20million rows.
Anyone have any other ideas for creating summary tables?
Re: Eploding memory when creating summary table. Why?
Wow. Either something strange is going on, or QlikView is using a very memory-inefficient algorithm. Not that QlikView would work strictly procedurally on the problem, but I'd have expected something like sorting the data by your group by fields, then going down the list summing up. It could just sort pointers to the original table, and then drop the pointers as it builds your summary records. It doesn't seem like it should take much memory at all. Maybe there was some memory vs. performance trade off they made, since what I mention would only be moderate performance? Still...
I'd get in touch with QlikTech. That just doesn't seem right.
To work around it if you can't wait to hear back from them, you could do the sums yourself. There's probably a more efficient way, but something like this maybe:
LEFT JOIN (Detail) LOAD key fields ,if(key1<>previous(key1)
or key2<>previous(key2) etc.,1) as StartFlag RESIDENT Detail ORDER BY key fields ; LEFT JOIN (Detail) LOAD key fields ,previous(StartFlag) as EndFlag RESIDENT Detail ORDER BY key fields descending ; Summary: NOCONCATENATE LOAD key fields ,EndFlag ,Sales + if(StartFlag,0,peek(Sales)) as Sales ,Cost + if(StartFlag,0,peek(Cost)) as Cost ,etc. RESIDENT Detail ; INNER JOIN (Summary) LOAD 1 as EndFlag AUTOGENERATE 1 ; DROP TABLE Detail; DROP FIELD EndFlag;