Re: Incremental Loads are dead, long live the CSV - Page 2 - Qlik Community

Anonymous · ‎2014-08-15

Hi Qlikers

The days of incremental loads are gone, there is a new kid on the block and its name is CSV's, whoa hold on a minute "how so ?" you ask.

I hope that your sitting comfortably, if so then I shall begin.

Twas a rainy Monday morning and this particular 35M record QVD finally snapped and to rebuild it from scratch took just over three and a half hours, which is not too bad I suppose, but we were not happy with that and with the demand for the latest up to date data we had a problem to solve.

We tried building chunked 1M record QVD's which still took three hours to build, so next we tried converting those chunked QVD's into CSV's and this is where the magic happend, when we imported those CSV's into Qlikview it took six minutes, hmmmm that's quick.

So a quick brainstorming session to find a way of expanding on the CSV approach, we came up with the following

Run a PHP script that pulls data from the Database in small 10K chunks then save it as a CSV.
Pick ranges based on a primary key and append to filename
- users_0-10000.csv
- users_10001-20000.csv
- users_20001-30000.csv
Run the script on multiple instances - this took six minutes to build all CSV's.
RSYNC the CSV's up to the Qlikview server.
Run QVS script to build QVD's from the CSV's - this takes less than a minute to complete.

So now we have made a process where we add the table's we wish to build into an XLS file, and with a few clever enhancements to the process we can now rebuild all of our QVD's in less than an hour.

We experimented with with (RSYNC, FTP, SCP) for transporting CSV files, but found that SCP offered compression so we used it.

The team wish for this procedure to be called Batch Load, and it will be called always be called so, from this day forth.

Thanks to the team - Dave, Oke and Rich. Your Awesome.

Anonymous · ‎2014-08-21

I did not know what else to post it as

evan_kurowski · ‎2014-08-21

Very thoughtful, informative, and innovative dave. I would definitely like to learn more about the db ~>php ~> csv process.

When I had to extract large tables in the past (personal experience for splitting QVDs is around the > 2GB range. After that, unless the server is very robust with plenty of RAM/CPU, there becomes a noticeable amount of response lapse).

So let's say the last time I had to do an extraction of this sort, we have an incremental process against a 35GB table holding 3 years of transactions, and the newly incremental transactions could update any part of the QVD range (not just the most recent time-period).

Throughput was limited by ODBC pipe and how many connections the server would actually pull data through (even if you launched more). In that last environment, up to 3 concurrent connections were allowed data flow. (For example, launching 10 concurrent instances of QlikView did not pull data 10x faster, but launching 3x instances of QV did pull data faster than launching 1x instance. After adding more than 3 simultaneous processes, you saw the transfer speeds slow down evenly across all threads).

Partition was done by transaction date month, QVinstance #1 was assigned to go after 2013, QVinstance #2 was assigned 2012, and QVinstance#3 was assigned 2011. The limitation on the extraction speed wasn't a restriction from the part of QlikView, because if I had been allowed 36 ODBC connections and been able to pull 1 month per thread, drawdown times surely would've been much faster. A standalone month was anywhere from 1 to 2GB of data and took from 7 to 10 mins, so x12 months (x3 threads) the whole process was in about under 2 hours. That was only done once for the initial QVD repository seeding. Then intra-daily the incremental record drawn from the RDMS set applied against the QVD repository took under 7 minutes to apply full incremental reload.

So I guess what I would like to know about the DB ~> php ~> csv process is what type of connection is being established? How does it compare against ODBC/OLE throughput? If it is much faster than ODBC, then why is ODBC established as the standard for data draw, and if it is similar to ODBC then if ODBC were allowed to establish as many concurrent connections as the php process, could we potentially get similar speed from ODBC to QV process?

Really appreciate you breaking this down for us and again, any specifics are truly enlightening. Seems like you guys are trying to push the envelope. Of course I am talking about commercially non-sensitive stuff, obviously somewhere in google or secret land they are probably chomping terra-bezel-bytes of data, but for ranges that are pertinent to us, which are large tables of enterprise caliber QlikView clients.

Anonymous · ‎2014-08-23

Hi Evan

i will be back to work on Tuesday and will build a more detailed flow diagram of the db > php > scp process for you.

the devs involved have done an awesome job, there are some really nice touches they've done, which you will really like, the summary diagrams are just that summaries, I will break them down even more.

Anonymous · ‎2014-08-27

Hi Qlikers

As promised I will go into the server side scripting aspect with a little bit more of an explanation on how we are using the CSV Batch Load process.

PREPARATION

First we will need to analyse our DB table.

We decided that the best and most efficient way of breaking a small/medium/large DB table into manageable chunks was to use an auto-incremental ID field and if one doesn't exist then we would create one.

Now we look at the total record count for our table, lets say we have a 35M record customer table and we want to break the table down by 20K records at a time.

35M / 20K = 1750 chunks, each chunk will be a csv file.

We will create 1750 records in our csv_ranges config table, thus each chunk will have a corresponding entry in the table:

1, users_0-20000, 0, 20000, 10, 0000-00-00 00:00:00

1, users_20001-40000, 20001, 40000, 10, 0000-00-00 00:00:00

1, users_40001-60000, 40001, 60000, 10, 0000-00-00 00:00:00

1, users_60001-80000, 60001, 80000, 10, 0000-00-00 00:00:00

...... etc

We also need to create an entry into the csv_query config table with the DB details for the table and we also add a simple SELECT query "SELECT * FROM users ", that the script will use later.

For our process to constantly run and update the CSV files we created the following DB tables that our script will loop over and extract information etc. I'm just going to list some key fields and not the full schema for each table.

csv_history

Each time a chunk is created or modified it gets added to this table, we can then analyise it to work out how often a chunk get updated, this will enable us to automatically update the refresh value for the chunk in the csv_ranges table.

table_id

chunk_name

timestamp

refresh_value

csv_query

This hold's a basic query string and table connection details.

table_id

table_name

db_name

auto-incremental ID field name

dns

query

enable

csv_ranges

We build multiple ranges for each table that's in the csv_query table.

table_id

chunk_name

min_range

max_range

refresh_value

last_updated

locked_by

md5

I'm not going to explain the reasons why we choose this way of doing things as we could chat all day about it and I just want to give you the juicy details.

One of the problems we encountered was with multiple processes updating the same chunk, so we added a locked_by process field, this eliminates that problem.

OK thats the preparation taken care of.

THE SERVER SIDE SCRIPT

This is a relatively straight forward process and I'm just going to talk about the logic involved not the code, also I'm going to use our 35M record users table as an example, but we loop over the csv_query table which has 100+ tables in it.

We run this script six times on three different backend servers, this gives us a grand total of eighteen processes to create/update the CSVs, the beauty of this process is its fully scalable if we add another 100 tables to chunk, then we just add more backend servers.

FOREACH: Extract our users table config from the csv_query table.

FOREACH: Extract the ranges from the csv_ranges table to loop over.

Has the current chunk for users_0-20000 been created yet.

Lock chunk in the csv_ranges table.

Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};

MD5 results from query and set last_updated timestamp in csv_ranges for current range.

Create CSV file users_0-20000.csv.

SCP CSV file to Qlikview server.

Add entry in the csv_history table.

If the current time is greater than (last_updated + refresh_value) for the users_0-20000 chunk

Lock chunk in the csv_ranges table.

Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};

MD5 results from query.

If the current MD5 hash is different from new MD5 hash

Replace CSV file users_0-20000.csv.

SCP CSV file to Qlikview server.

Add entry in the csv_history table.

THE QLIKVIEW SERVER

Create a QVW to import CSVs and store as QVDs, its not quite as simple as that, but you get the picture.

+++++++++++

Thats it I'm afraid, I will tidy up all my posts on this subject and add it to the resource library.

If you have any questions on that matter then please ask, or if you want me to video capture the process then the more likes I get will persuade me.

Just a final word on the speed of the process, I rebuilt the 35M users table yesterday and looking at the created time of the first and last QVDs on the Qlikview Server was 10 minutes total creation time, and thats with adding new fields and calculations to the QVW load script .... SWEET AS A NUT

Incremental Loads are dead, long live the CSV

QlikView App Dev