Discussion Board for collaboration on QlikView Scripting.
I'm looking for some data handling strategy feedback.
I have regional data (same format) sitting on 5 separate servers. Each region has 1 month of log data, made available in Hive. The data volume is huge, so I need to:
(1) transform/aggregate it during load and
(2) store the aggregated content for up to a year.
I have a working incremental load script (using a qvd file) for one region (one ODBC connection).
Because loads from each region can fail independently, I would like to keep the regional data in separate qvd files, so that each can be corrected/updated on the subsequent incremental load execution. This means that for EACH connection/region I have to track start/end dates for both qvd file and the current hive load.
...I'm assuming I would have to edit [date] variable names so they're different for each connection e.g. vHiveLoadStartDateRegionA, vHiveLoadEndDateRegionA, vHiveLoadStartDateRegionB, vHiveLoadEndDateRegionB, etc. (I understand QV does not have a method of restricting variable scope).
What's the best way to handle this?
Should I have 5 copies of the same connection script but each with different connection, file, and variable names?
Should I apply some sort of a loop, where the connection, file, and variable names are auto-generated on each iteration?
Regardless of the strategy, what's the best way to merge the regional data for QV visualization, once incremental loads are done?
Solved! Go to Solution.
How about :
I create discrete qvd's for each region, each having a column for [Region]. It is done in a loop using a spreadsheet holding the Region, Connection String etc.
Then in the dashboard qvw I concatenate them all together.
Use a loop as you are thinking. You can maintain the Region and Connection string in a table, inline or external as Bill suggested.
However, I wouldn't try to keep track of start/end range with variables, as the variables are in the QVW and that doesn't reliably match the QVD. Instead get the start/end by looking at the QVD each time. Here's a fast performing method to manage that:
If I understand, you have your source data on 5 different server (Hive).
You can reduce the load window starting 5 parallel task; the the last task (6th) will synchronize (wait all the regional) and merge the qvd for the visualization.
Running the jobs asynchronously would be preferred, but based on what I can find in the QV docs, an open connection is automatically disconnected when a new 'connect' statement is encountered (from the help doc on the 'disconnect' keyword). Is there a way to force parallel execution with multiple connections open simultaneously?
You can run tasks [aka jobs] in parallel, assuming you have enough server grunt to handle the load.
You could still have a single qvw for the loads with a parameter for Region so it picks up the connection string etc.. from your connection details spreadsheet / table for the row for that Region.
For keeping track of start / end dates I store them all in a qvd, concatenating on extra rows when needed. This qvd can then be read and variables set to the values stored in it.
I think I can wrap my head around the multi qvw implementation maxgro suggested (assuming I can set task dependencies in QV when scheduling the final/aggregate qvw), but I haven't quite figured out how a parametrized solution will help overcome the sequential nature of ODBC connect statements. Any chance you can manufacture some high-level pseudo code to demo the flow?
(Keep in mind this is my 2nd QV dashboard, with the 1st one dating back about 2 years, so I may at times be missing the obvious).