Solved: Re: Incremental loading and merging of data from m... - Qlik Community

jwaligora · ‎2015-11-30

Hi,

I'm looking for some data handling strategy feedback.

Scenario:

I have regional data (same format) sitting on 5 separate servers. Each region has 1 month of log data, made available in Hive. The data volume is huge, so I need to:

(1) transform/aggregate it during load and

(2) store the aggregated content for up to a year.

Current Status:

I have a working incremental load script (using a qvd file) for one region (one ODBC connection).

Challenge:

Because loads from each region can fail independently, I would like to keep the regional data in separate qvd files, so that each can be corrected/updated on the subsequent incremental load execution. This means that for EACH connection/region I have to track start/end dates for both qvd file and the current hive load.

...I'm assuming I would have to edit [date] variable names so they're different for each connection e.g. vHiveLoadStartDateRegionA, vHiveLoadEndDateRegionA, vHiveLoadStartDateRegionB, vHiveLoadEndDateRegionB, etc. (I understand QV does not have a method of restricting variable scope).

Question:

What's the best way to handle this?

Should I have 5 copies of the same connection script but each with different connection, file, and variable names?

Should I apply some sort of a loop, where the connection, file, and variable names are auto-generated on each iteration?

Regardless of the strategy, what's the best way to merge the regional data for QV visualization, once incremental loads are done?

Thanks,

J.

Anonymous · ‎2015-12-01

How about :

Input parameter of Region to a single qvw

Spreadsheet with 1 row per Region
- 1st column is Region
- 2nd column is the ODBC connection string for the region
- .. any other required columns
In qvw read in spreadsheet row where Region = parameter region
- This will produce just one row
- Put the ODBC connection string into a variable
- Dollar expand this variable to execute it to do the connection
Create a Publisher Task for each Region
- Passing in the Region as a parameter
- Run all these Publisher Tasks in parallel

Have your dashboard reload Task dependent on all the Region Tasks having run successfully

View solution in original post

Anonymous · ‎2015-11-30

I create discrete qvd's for each region, each having a column for [Region]. It is done in a loop using a spreadsheet holding the Region, Connection String etc.

Then in the dashboard qvw I concatenate them all together.

rwunderlich · ‎2015-11-30

Use a loop as you are thinking. You can maintain the Region and Connection string in a table, inline or external as Bill suggested.

However, I wouldn't try to keep track of start/end range with variables, as the variables are in the QVW and that doesn't reliably match the QVD. Instead get the start/end by looking at the QVD each time. Here's a fast performing method to manage that:

Super Fast Method to Retrieve QVD High Value | Qlikview Cookbook

jwaligora · ‎2015-11-30

Thanks bill.markham & rwunderlich,

I will commence spreadsheet / temp table experimentation tomorrow.

I expect variable management will be the biggest headache...so I'll be back here crying for help in no time!

For some added context, I attached my single-connection code.

J.

maxgro · ‎2015-11-30

If I understand, you have your source data on 5 different server (Hive).

You can reduce the load window starting 5 parallel task; the the last task (6th) will synchronize (wait all the regional) and merge the qvd for the visualization.

jwaligora · ‎2015-12-01

Running the jobs asynchronously would be preferred, but based on what I can find in the QV docs, an open connection is automatically disconnected when a new 'connect' statement is encountered (from the help doc on the 'disconnect' keyword). Is there a way to force parallel execution with multiple connections open simultaneously?

maxgro · ‎2015-12-01

You're right if you think of one .qvw.

But you can use, in server env, 5 parallel task (5 different .qvw loading data in parallel).

Anonymous · ‎2015-12-01

You can run tasks [aka jobs] in parallel, assuming you have enough server grunt to handle the load.

You could still have a single qvw for the loads with a parameter for Region so it picks up the connection string etc.. from your connection details spreadsheet / table for the row for that Region.

For keeping track of start / end dates I store them all in a qvd, concatenating on extra rows when needed. This qvd can then be read and variables set to the values stored in it.

jwaligora · ‎2015-12-01

I think I can wrap my head around the multi qvw implementation maxgro‌ suggested (assuming I can set task dependencies in QV when scheduling the final/aggregate qvw), but I haven't quite figured out how a parametrized solution will help overcome the sequential nature of ODBC connect statements. Any chance you can manufacture some high-level pseudo code to demo the flow?

(Keep in mind this is my 2nd QV dashboard, with the 1st one dating back about 2 years, so I may at times be missing the obvious).

Anonymous · ‎2015-12-01

Is their just a single ODBC connection for each of your Regions ?

And is this different for each Region ?

Incremental loading and merging of data from multiple connections