Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

How to reduce memory allocation for grouping data load in a script

Initially I had this:


AdPlayed:
LOAD
StationId
,AdId
,EST_DateTime
,0 as Predicted
,sum(AdsPlayedAmount) as AdsPlayedAmount
group by
StationId
,AdId
,EST_DateTime
;
Load
iStationID as StationId
,AdID as AdId
,floor(Date(DateDue)*24) / 24 as EST_DateTime
,ImpsPlayed as AdsPlayedAmount
;
SQL SELECT [iStationID]
,[DateDue]
,[AdID]
,[ImpsPlayed]
FROM [AdPlayList] (nolock) apl
where apl.DateDue >= '$(vStartDate)'
and apl.DateDue < '$(vEndDate)';


And I had got next script statistics:

187,7 millions record in result table.
Maximum memory allocation during script run is about 50 Gb.
Execution time is 1901 sec or ~31min.

I thought that maybe it happened because of stacked LOAD statements. I'd decided to split the statements. And I'd made this:


Temp:
Load
iStationID as StationId
,AdID as AdId
,floor(Date(DateDue)*24) / 24 as EST_DateTime
,ImpsPlayed as AdsPlayedAmount
;
SQL SELECT [iStationID]
,[DateDue]
,[AdID]
,[ImpsPlayed]
FROM [AdPlayList] (nolock) apl
where apl.DateDue >= '$(vStartDate)'
and apl.DateDue < '$(vEndDate)';

AdPlayed:
LOAD
StationId
,AdId
,EST_DateTime
,0 as Predicted
,sum(AdsPlayedAmount) as AdsPlayedAmount
Resident Temp
group by
StationId
,AdId
,EST_DateTime
;

DROP Table Temp;


Initial LOAD gave me 329.3 millions of records; 3.5 Gb in memory and 1139 sec ~ 19 min of script run.

After that the second LOAD statement had been started. During grouping part of the statement a memory grew up to 55GB. After aggregation part of the script the memory allocation was about 59,6 Gb.

At the end of the second script after DROP statement I'd got:

187.7 records in result table;
1.7 GB in a memory;
3734 sec ~ 62 min.

As you can see the first (stacked LOADs) is double faster and require less memory allocation. But I'm curious about huge memory allocation for grouping functions, when 3,5 Gb of initial data had been transformed into 59,6 - 3,5 = 56,1 Gb of memory. And after that I'd got just 1,7 Gb of the result table!!!

I can tell that I have about 18 000 of unique StationIds, about 300 000 of unique AdIds and the period is 400 days back ~ 9600 unique EST_DateTime.

The question is - what does QV do with a memory for grouping function? And is it possible to reduce a memory allocation for such statements?

12 Replies
Not applicable
Author

Hi,

Why don't you move the GROUP BY iStationID to SQL ? It surely saves QV resources. Transferring lots of useless data over slow network is also not nice for DB

-Alex

Not applicable
Author

This is DB solution, I'm looking for QV approach, because QV advice is to load data in QV and after that you should do all calculations. And this is pretty simple grouping function and it take so much memory. It's bizarre, isn't it?

johnw
Champion III
Champion III

It does seem strange to me. I wish I had something to suggest.

Not applicable
Author

Sometimes there can be 500 million rows in a data warehouse table, and you need it grouped into 50000 rows. That's a good example to do processing in database instead of insisting in doing in in QV.

-Alex

Not applicable
Author

Alex,

It's not a solution because sometimes there is no any database when you load data from flat files. But this behavior is strange.

Not applicable
Author

I've removed AdId from the script and it started to be smooth. It looks like that QV tries to create separate set of data per unique combination from GROUP BY clause by doing record duplication. And after that it implements a grouping function on each set. And only after that it frees a memory.

But I think from performance optimization stamp point and because of in-memory implementation it would be more robust to make data aggregation by unique index for fields from group by and make the grouping function. But from other hand maybe it was the index who ate the memory.

johnw
Champion III
Champion III

Since it's in the load script, I'm not sure to what extent the in-memory data structures resemble their eventual form. For instance, it doesn't create synthetic keys until the very end. There are surely other things it doesn't do until later.

If it were a regular database, I'd expect it to load in all of the rows, sort them by the group by fields, then scan through the resulting temporary table, adding up the data as it goes. I'd expect this to at most use twice the space of the original table. In other words, since you can load the table in 3.5 GB of memory, it's hard for me to imagine how the peak would be more than 7.0 GB of memory. Well, a little more, since it's storing the grouped results, but it could drop the original results as it completes each group, so the extra should be negligible. But for argument, let's say you're grouping on a unique ID, and it doesn't drop the original data. In that case, I could see it peaking at 10.5 GB of data.

So obviously it isn't behaving that way. You say removing the AdId made it smooth, by which I assume you mean it drastically reduced the peak memory required? So you had about 330 million records, and 190 million groups, and AdId was what was giving you so many groups? Still, I can't see how it would consume very much memory. Again, I wouldn't expect it to go over 10 GB, even if poorly coded.

How about this. What happens if you just load in all the raw data (3.5 GB, right?), and then do aggregation in a chart by using those fields as dimensions? Eh, it'll probably blow up. But if not, I'd be curious if it consumes the same 50-60 GB of data, or if it's less or more. What I'm curious about is if it does the group by in the script the same as the group by in the chart. It might provide some hint on what's going on.

It's really making no sense to me, and I hate it when I can't make sense of something.

Maybe I should run some script tests. If I could see when and how much the memory increases with the group by, maybe it would be a clue as to how QlikView is doing the work.

johnw
Champion III
Champion III

OK, the problem is easily reproduced with a much smaller example. I don't have a lot of data points to support this, but equal increases in the number of groups appear to produce approximately equal increases in additional peak memory required over the storage for the base table. In my case, every million groups added about 200 MB of memory.

These additions also started at the memory for the base table, not at double as would be suggested by the method I mentioned before. So that's even further evidence that QlikView is not using that method. For a small number of groups, it's using a very small amount of memory, much more efficient than duplicating the full table. But for a large number of groups, it's using a huge amount of memory, much less efficient than duplicating the full table.

The group by also appears to follow a very specific, three phase pattern. In the first phase, no rows are being loaded into the final table, and memory use is increasing quickly and linearly over time. In the second phase, still now rows are being loaded into the final table, memory use remains flat at the peak, and a fair amount of time passes, perhaps approximately equal to the first phase. In the third phase, memory spikes a little higher than before as the rows are actually loaded into the final table.

Let's see, if I had data stored like QlikView and I had to code an algorithm to efficiently process SMALL numbers of groups, how would I do it? Well, in my case I'm only grouping by a single field. So maybe I'd move down through the values of that field, using my normal relational processing to return all the rows for that value, then sum those rows up, insert it into my final table, and move on to the next value.

But the observations are inconsistent with that. That shouldn't use much memory at all, since each group is processed independently, and then the working storage for it could be dropped. Also, it should result in a one phase memory profile instead of a three phase memory profile, as each group gets processed completely before moving on to the next.

For fun, I'm trying the experiment I mentioned earlier - making a chart do my "group by". Hah, no problem. It ran much faster, and had a much lower memory peak than doing the equivalent during the load. Charts are almost definitely handling it differently than the load. That seems peculiar, though perhaps it does have something to do with not having the full in-memory model until the end of the load. Even the chart used more peak memory than I think it should, though.

OK, the first phase of a script group by is linear increases in memory. That sounds like loading some sort of internal data structure. The second phase is flat, which sounds like just computation being done on that data structure. Finally, we spike as we load the real data, during which probably both the internal data structure and the final table exist at the same time. That's my theory, anyway.

If that is roughly what's going on, the internal data structure isn't actually holding the full table of data, as then it wouldn't use so little memory for few groups and so much memory for many groups. It uses approximately the same amount of memory per group, even though the number of rows per group is decreasing with the number of groups since I used the same size table each time. So the memory used has nothing to do with the number of rows in that group. Well, how many bytes per group? In my example, about 200 bytes of memory per group, regardless of size of group. Hmmm, does the size of a row in the underlying table affect it? That doesn't appear to affect it. What about having twice as many rows in the underlying table? That doesn't appear to affect it.

Well, this hypothetical data structure would have to have a key for the group, so at least the group value or a pointer to it. Since memory doesn't go up during the hypothetical calculation phase, we'll assume it's reserving space for the calculation. But in my example, that doesn't seem like more than a small fraction of the 200 bytes (well, 180 bytes). So what are all the other bytes? What data would help us to do the calculation? The obvious thing to me is some way at pointing at the "rows" of data, but we obviously can't be storing the rows or even pointers to the individual rows, or the size required would go up with the number of rows in each group.

No idea. I am not currently thinking of any useful internal data structure consistent with what I've observed. Perhaps I'm off track, and there's no such structure. Or perhaps I'm just not thinking of what it would include.

My script, in case it's useful. I was mostly fiddling with the mod number and the autogenerate number.

A:
LOAD
mod(recno(),3000000) as GroupID
,1 as Value
AUTOGENERATE 5000000
;

B:
LOAD
GroupID
,sum(Value) as Value2
RESIDENT A
GROUP BY GroupID
;

I'm giving up for now. Maybe someone else can think of something, or maybe a developer can comment.

Not applicable
Author

This is absolutely the same picture what I have. And the more fields in group by the more memory needed for grouping function. I'm worried to make this assumption but it looks like QV tries to create duplicate sets of data for each field from group by and after that on a top of these sets it implements the unique grouping combination.