Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all,
I'm using the GROUP BY statment to load and aggregate 300M recs and it takes huge amount of time to load the data.
Will appreciate your advice on how to reduce the load time.
the script is as follows:
sessions:
load
id
,
"occurrence_date"
,
date
(floor(occurrence_date)) as Date,
day
(floor(occurrence_date)) as e_day,
year
(floor(occurrence_date)) as e_year,
month
(floor(occurrence_date)) as e_month,
week
(floor(occurrence_date)) as e_week,
hour
(occurrence_date) as e_hour,
"session_id"
,
"user_id"
FROM sessions.qvd(qvd);
left
join
load
"event_id"
as id,
"param_name"
,
"param_value"
,
if
(param_name='merchantName',param_value) as merchantName,
if
(param_name='dlsource',param_value) as dlsource_for_the_session,
if
(param_name='CD_CTID' ,param_value) as dl_Sub_source_for_the_session,
if
(param_name='category_id',param_value) as Category_ID,
if
(param_name='source_id',param_value) as Source_ID
FROM
event_params.qvd(qvd);
aggr:
load
session_id
,
count
(distinct if (param_name<>'slideUp' and action='initial search', id)) as Total_searches,
count
(distinct if (action='merchant click',id)) as Total_merchant_click,
count
(if (action='initial search' and param_name='before' and param_value<>'2',user_id)) as Total_users_initiated_and_slideup,
count
(if (action='initial search' and param_name='slideUp' and param_value='0',param_name)) as System_searches_no_slideUps,
count
(if (action='initial search' and param_name='slideUp' and param_value='1',param_name)) as Slideup_searchs
Resident
sessions group by session_id;
action,David,
well, GROUP BY and LEFT JOIN are both pretty heavy operations, especially when applied to some 300M rows - that's a very large database. No matter what, it's going to take time. Here are some of the techniques you might consider:
1. THink of implementing incremental load, to reduce the number of rows that need ot be processed every night.
2. You have many IF() formulas, and those are extremely slow. Typically we neglect the impact of IF() on the load script, but with 300M rows you can't afford it. Consider replacing your IF() formulas with multiple loads using WHERE. For example:
LOAD
session_id, count (distinct if (param_name<>'slideUp' and action='initial search', id)) as Total_searches,
count (distinct if (param_name<>'slideUp' and action='initial search', id)) as Total_searches,
...
will be much slower than the following:
I realize that you might have to run several separate loads like this, but my feeling that it will be faster this way than evaluating multiple IF conditions over 300M rows, each row by row.
I believe that your load will go much faster if you can get rid of your IF() conditions.
cheers,
LOAD
session_id,
RESIDENT ...
WHERE param_name<>'slideUp' and action='initial search'
In the future when posting code, try either cutting and pasting to notepad first, and then to the forum from notepad. Or go into settings -> user preferences -> editor, and remove the checkmark next to "Copy as RTF".
On to the question itself.
First, when loading from a QVD, it is best if you can do an optimized load. Things like date(floor(occurrence_date)) and hour(occurrence_date)should already have been done when creating the QVD itself.
As for day, week, month and year, I'd create a separate calendar table for those. There's no reason to redo that sort of logic for every single row in your table, even during QVD creation.
If I'm interpreting your event_params load correctly, you're both creating a separate row for each parameter AND a separate field for each parameter. Shouldn't you have one or the other? And why, at least if you have separate rows, would you left join onto the sessions table, since that will duplicate every row of the sessions table for every parameter you have. I'd just keep the table separate and/or remove the param_name in favor of specific fields for the parameters.
Down in the group by, if you had separate fields for the parameters of interest, and only one row for each ID instead of multiple, the group by would run more quickly.
But I'm probably totally misunderstanding everything. It's kind of hard to read.
this whole "code" keyword doesn't work with me here, but I hope you can get the point, despite some glitches in presentation...