Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi Qlikers
The days of incremental loads are gone, there is a new kid on the block and its name is CSV's, whoa hold on a minute "how so ?" you ask.
I hope that your sitting comfortably, if so then I shall begin.
Twas a rainy Monday morning and this particular 35M record QVD finally snapped and to rebuild it from scratch took just over three and a half hours, which is not too bad I suppose, but we were not happy with that and with the demand for the latest up to date data we had a problem to solve.
We tried building chunked 1M record QVD's which still took three hours to build, so next we tried converting those chunked QVD's into CSV's and this is where the magic happend, when we imported those CSV's into Qlikview it took six minutes, hmmmm that's quick.
So a quick brainstorming session to find a way of expanding on the CSV approach, we came up with the following
So now we have made a process where we add the table's we wish to build into an XLS file, and with a few clever enhancements to the process we can now rebuild all of our QVD's in less than an hour.
We experimented with with (RSYNC, FTP, SCP) for transporting CSV files, but found that SCP offered compression so we used it.
The team wish for this procedure to be called Batch Load, and it will be called always be called so, from this day forth.
Thanks to the team - Dave, Oke and Rich. Your Awesome.
I did not know what else to post it as ![]()
Very thoughtful, informative, and innovative dave. I would definitely like to learn more about the db ~>php ~> csv process.
When I had to extract large tables in the past (personal experience for splitting QVDs is around the > 2GB range. After that, unless the server is very robust with plenty of RAM/CPU, there becomes a noticeable amount of response lapse).
So let's say the last time I had to do an extraction of this sort, we have an incremental process against a 35GB table holding 3 years of transactions, and the newly incremental transactions could update any part of the QVD range (not just the most recent time-period).
Throughput was limited by ODBC pipe and how many connections the server would actually pull data through (even if you launched more). In that last environment, up to 3 concurrent connections were allowed data flow. (For example, launching 10 concurrent instances of QlikView did not pull data 10x faster, but launching 3x instances of QV did pull data faster than launching 1x instance. After adding more than 3 simultaneous processes, you saw the transfer speeds slow down evenly across all threads).
Partition was done by transaction date month, QVinstance #1 was assigned to go after 2013, QVinstance #2 was assigned 2012, and QVinstance#3 was assigned 2011. The limitation on the extraction speed wasn't a restriction from the part of QlikView, because if I had been allowed 36 ODBC connections and been able to pull 1 month per thread, drawdown times surely would've been much faster. A standalone month was anywhere from 1 to 2GB of data and took from 7 to 10 mins, so x12 months (x3 threads) the whole process was in about under 2 hours. That was only done once for the initial QVD repository seeding. Then intra-daily the incremental record drawn from the RDMS set applied against the QVD repository took under 7 minutes to apply full incremental reload.
So I guess what I would like to know about the DB ~> php ~> csv process is what type of connection is being established? How does it compare against ODBC/OLE throughput? If it is much faster than ODBC, then why is ODBC established as the standard for data draw, and if it is similar to ODBC then if ODBC were allowed to establish as many concurrent connections as the php process, could we potentially get similar speed from ODBC to QV process?
Really appreciate you breaking this down for us and again, any specifics are truly enlightening. Seems like you guys are trying to push the envelope. Of course I am talking about commercially non-sensitive stuff, obviously somewhere in google or secret land they are probably chomping terra-bezel-bytes of data, but for ranges that are pertinent to us, which are large tables of enterprise caliber QlikView clients.
Hi Evan
i will be back to work on Tuesday and will build a more detailed flow diagram of the db > php > scp process for you.
the devs involved have done an awesome job, there are some really nice touches they've done, which you will really like, the summary diagrams are just that summaries, I will break them down even more.
Hi Qlikers
As promised I will go into the server side scripting aspect with a little bit more of an explanation on how we are using the CSV Batch Load process.
PREPARATION
First we will need to analyse our DB table.
We decided that the best and most efficient way of breaking a small/medium/large DB table into manageable chunks was to use an auto-incremental ID field and if one doesn't exist then we would create one.
Now we look at the total record count for our table, lets say we have a 35M record customer table and we want to break the table down by 20K records at a time.
35M / 20K = 1750 chunks, each chunk will be a csv file.
We will create 1750 records in our csv_ranges config table, thus each chunk will have a corresponding entry in the table:
1, users_0-20000, 0, 20000, 10, 0000-00-00 00:00:00
1, users_20001-40000, 20001, 40000, 10, 0000-00-00 00:00:00
1, users_40001-60000, 40001, 60000, 10, 0000-00-00 00:00:00
1, users_60001-80000, 60001, 80000, 10, 0000-00-00 00:00:00
...... etc
We also need to create an entry into the csv_query config table with the DB details for the table and we also add a simple SELECT query "SELECT * FROM users ", that the script will use later.
For our process to constantly run and update the CSV files we created the following DB tables that our script will loop over and extract information etc. I'm just going to list some key fields and not the full schema for each table.
csv_history
Each time a chunk is created or modified it gets added to this table, we can then analyise it to work out how often a chunk get updated, this will enable us to automatically update the refresh value for the chunk in the csv_ranges table.
table_id
chunk_name
timestamp
refresh_value
csv_query
This hold's a basic query string and table connection details.
table_id
table_name
db_name
auto-incremental ID field name
dns
query
enable
csv_ranges
We build multiple ranges for each table that's in the csv_query table.
table_id
chunk_name
min_range
max_range
refresh_value
last_updated
locked_by
md5
I'm not going to explain the reasons why we choose this way of doing things as we could chat all day about it and I just want to give you the juicy details.
One of the problems we encountered was with multiple processes updating the same chunk, so we added a locked_by process field, this eliminates that problem.
OK thats the preparation taken care of.
THE SERVER SIDE SCRIPT
This is a relatively straight forward process and I'm just going to talk about the logic involved not the code, also I'm going to use our 35M record users table as an example, but we loop over the csv_query table which has 100+ tables in it.
We run this script six times on three different backend servers, this gives us a grand total of eighteen processes to create/update the CSVs, the beauty of this process is its fully scalable if we add another 100 tables to chunk, then we just add more backend servers.
FOREACH: Extract our users table config from the csv_query table.
FOREACH: Extract the ranges from the csv_ranges table to loop over.
Has the current chunk for users_0-20000 been created yet.
Lock chunk in the csv_ranges table.
Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};
MD5 results from query and set last_updated timestamp in csv_ranges for current range.
Create CSV file users_0-20000.csv.
SCP CSV file to Qlikview server.
Add entry in the csv_history table.
If the current time is greater than (last_updated + refresh_value) for the users_0-20000 chunk
Lock chunk in the csv_ranges table.
Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};
MD5 results from query.
If the current MD5 hash is different from new MD5 hash
Replace CSV file users_0-20000.csv.
SCP CSV file to Qlikview server.
Add entry in the csv_history table.
THE QLIKVIEW SERVER
Create a QVW to import CSVs and store as QVDs, its not quite as simple as that, but you get the picture.
+++++++++++
Thats it I'm afraid, I will tidy up all my posts on this subject and add it to the resource library.
If you have any questions on that matter then please ask, or if you want me to video capture the process then the more likes I get will persuade me.
Just a final word on the speed of the process, I rebuilt the 35M users table yesterday and looking at the created time of the first and last QVDs on the Qlikview Server was 10 minutes total creation time, and thats with adding new fields and calculations to the QVW load script .... SWEET AS A NUT ![]()