topic Incremental Loads are dead, long live the CSV in QlikView

Incremental Loads are dead, long live the CSV

Anonymous — Fri, 15 Aug 2014 17:23:43 GMT

Hi Qlikers

The days of incremental loads are gone, there is a new kid on the block and its name is CSV's, whoa hold on a minute "how so ?" you ask.

I hope that your sitting comfortably, if so then I shall begin.

Twas a rainy Monday morning and this particular 35M record QVD finally snapped and to rebuild it from scratch took just over three and a half hours, which is not too bad I suppose, but we were not happy with that and with the demand for the latest up to date data we had a problem to solve.

We tried building chunked 1M record QVD's which still took three hours to build, so next we tried converting those chunked QVD's into CSV's and this is where the magic happend, when we imported those CSV's into Qlikview it took six minutes, hmmmm that's quick.

So a quick brainstorming session to find a way of expanding on the CSV approach, we came up with the following

Run a PHP script that pulls data from the Database in small 10K chunks then save it as a CSV.
Pick ranges based on a primary key and append to filename
- users_0-10000.csv
- users_10001-20000.csv
- users_20001-30000.csv
Run the script on multiple instances - this took six minutes to build all CSV's.
RSYNC the CSV's up to the Qlikview server.
Run QVS script to build QVD's from the CSV's - this takes less than a minute to complete.

So now we have made a process where we add the table's we wish to build into an XLS file, and with a few clever enhancements to the process we can now rebuild all of our QVD's in less than an hour.

We experimented with with (RSYNC, FTP, SCP) for transporting CSV files, but found that SCP offered compression so we used it.

The team wish for this procedure to be called Batch Load, and it will be called always be called so, from this day forth.

Thanks to the team - Dave, Oke and Rich. Your Awesome.

Re: Incremental Loads are dead, long live the CSV

ecolomer — Fri, 15 Aug 2014 18:52:49 GMT

sounds good, we will try !!!

good job

Re: Incremental Loads are dead, long live the CSV

Anonymous — Fri, 15 Aug 2014 18:54:57 GMT

Spread the word, its a truly simple but effective concept.

Re: Incremental Loads are dead, long live the CSV

evan_kurowski — Fri, 15 Aug 2014 21:12:59 GMT

Great post dave,

I've seen other approaches to this where they partition large QVDs. That way you aren't always rebuilding or incrementally refreshing the full breadth of one enormous QVD, just the pertinent portions (and once you climb up above the 1GB mark, the file handling times start to get noticeable). But the UI applications are fed an array of QVD slice vs. one giant glob. If databases are going to partition, why can't we? *hi-five*

Re: Incremental Loads are dead, long live the CSV

tobias_klett — Fri, 15 Aug 2014 21:48:54 GMT

Hi Dave,
thanks for pointing that out.
We have used CSVs to extract data from SAP before. It was a lot faster than via the SAP-Connector.

Regards

Tobias

Re: Incremental Loads are dead, long live the CSV

Anonymous — Fri, 15 Aug 2014 21:58:45 GMT

next week we will post more details on the process, it's more complicated than my summary, it's beautifully executed.

the CSV's are a mirror of the db and if an older record get updated than that csv chunk also gets updated, which then uploads to the QV server, QV then refreshes the corresponding qvd, job done.

there is still some clever gems hidden inside that we will share with the community, your going to really enjoy.

Re: Incremental Loads are dead, long live the CSV

evan_kurowski — Fri, 15 Aug 2014 22:15:34 GMT

This sounds like a big breakthrough dave,

Btw, how much faster is the transfer rate from RDMS to CSV than the throughput of the ODBC or OLE connection?

I can't tell exactly from your description, but it sounds like getting the full dataset into CSV is much faster than ODBC throughput and you can dump the full extent of your large table from database to flat-file in 6 minutes.

And then it seems you have the .CSV load into QlikView at a speed that rivals load times of an optimized QVD series, this process is getting 35M rows of .CSV data into QVDs in under a minute? (Is this a "normal" size table? It's not just one column, is it?)

I gotta admit, I'm somewhat tempted to start telling all your friends & coworkers behind your back that you've been taking crazy pills, without having either seen and/or understood your idea, because... even in the event this preposterous sorcery works, and let's say you really did create something radically innovative and awesome. All that means is that now I'm the "former" developer of the "world's fastest incremental process" and well... where's the upside in that?

But... you, you've got charisma and enthusiasm, and that is just irresistible. Incremental loads have been around for a long time, but... turn me from a skeptic to a believer. More details please!!

Re: Incremental Loads are dead, long live the CSV

Anonymous — Sat, 16 Aug 2014 04:13:32 GMT

Evan

i promise to post a very detailed understandable description of the whole process on Monday, this is so awesome when you read on how it work it will blow your mind, we will provide you with actual benchmarks.

To see these files updating and rebuilding on the fly is something beautiful, we all know how long it takes roughly to build a 35m row 40 column database table from an odbc connection directly into QV and how painfully slow it is, but chunking the contents of the db table into small 10k shards externally of QV and through a php script that we can run and scale is blazingly fast.

Our senior dev thought it would take at least half an hour to run through, but when it built our most difficult and problematic table in six minutes his jaw dropped, we've really been on a roller coaster ride on this project and we want to show and tell the community of the fruits of our labour.

I'm really excited about this and wanted to give you all a quick taster, and follow it up with the main course, yum yum.

Re: Incremental Loads are dead, long live the CSV

Anonymous — Thu, 21 Aug 2014 17:01:06 GMT

Hi All

Sorry for the delay in my response, but we have all been very busy.

Any ways trying to get the team to document anything is a difficult task, but they are aware the process needs documenting and are currently working on it.

I wanted to keep my end of the bargain up, so I've created some basic flow diagrams for you all to ponder over.

Diagram 1:

A standard incremental load diagram

Import current QVD and get maxID.
Qlikview QVW connects to ODBC which in turn connects to MySQL.
Extract latest data and append to loaded QVD.
Store as updated QVD.
QVD ready to be used in multiple reports.

This process can start to get laggy as the QVD grows in size, and its a resource hog.

If we have to add a new column or rebuild the complete QVD the whole process for a particular QVD we had took 3.5 hours.

Diagram 2:

CSV Batch Load method diagram:

Server side script connects to database and extracts small chunk of records from DB.
Save small chunk of data as a CSV file.
This is the great part, we can scale up the server side script and run more instances of it on multiple servers, currently we're running 6 scripts on 3 servers giving us 18 server side scripts to extract chunks from the DB, this is blazingly fast.
SCP copies data to the Qlikview Server.
QVW file loads all the CSV's with a wildcard.
We process the CSV's and make any changes to the chunk of data.
Store data as QVD's.

The process to extract 35 Million rows from the DB and save as CSV's took 4 minutes, then it takes 6.5 minutes for Qlikview to build all of the QVD's.

We have noticed that pulling in over 1,500 QVD's into a report is slow, so we still extract small chunks from DB, but now our Server Side Script builds larger CSV's, so now we have reduced the total CSV's to 69, which is much quicker to pull into our reports.

Also, where we process the data in step 6 it enables us to pull the optimised data into reports.

Yay, no more incremental loads.

Re: Incremental Loads are dead, long live the CSV

martynlloyd — Thu, 21 Aug 2014 17:03:53 GMT

Very nice, but why post as a question?

Re: Incremental Loads are dead, long live the CSV

Anonymous — Thu, 21 Aug 2014 17:07:40 GMT

I did not know what else to post it as

Re: Incremental Loads are dead, long live the CSV

evan_kurowski — Fri, 22 Aug 2014 00:48:54 GMT

Very thoughtful, informative, and innovative dave. I would definitely like to learn more about the db ~>php ~> csv process.

When I had to extract large tables in the past (personal experience for splitting QVDs is around the > 2GB range. After that, unless the server is very robust with plenty of RAM/CPU, there becomes a noticeable amount of response lapse).

So let's say the last time I had to do an extraction of this sort, we have an incremental process against a 35GB table holding 3 years of transactions, and the newly incremental transactions could update any part of the QVD range (not just the most recent time-period).

Throughput was limited by ODBC pipe and how many connections the server would actually pull data through (even if you launched more). In that last environment, up to 3 concurrent connections were allowed data flow. (For example, launching 10 concurrent instances of QlikView did not pull data 10x faster, but launching 3x instances of QV did pull data faster than launching 1x instance. After adding more than 3 simultaneous processes, you saw the transfer speeds slow down evenly across all threads).

Partition was done by transaction date month, QVinstance #1 was assigned to go after 2013, QVinstance #2 was assigned 2012, and QVinstance#3 was assigned 2011. The limitation on the extraction speed wasn't a restriction from the part of QlikView, because if I had been allowed 36 ODBC connections and been able to pull 1 month per thread, drawdown times surely would've been much faster. A standalone month was anywhere from 1 to 2GB of data and took from 7 to 10 mins, so x12 months (x3 threads) the whole process was in about under 2 hours. That was only done once for the initial QVD repository seeding. Then intra-daily the incremental record drawn from the RDMS set applied against the QVD repository took under 7 minutes to apply full incremental reload.

So I guess what I would like to know about the DB ~> php ~> csv process is what type of connection is being established? How does it compare against ODBC/OLE throughput? If it is much faster than ODBC, then why is ODBC established as the standard for data draw, and if it is similar to ODBC then if ODBC were allowed to establish as many concurrent connections as the php process, could we potentially get similar speed from ODBC to QV process?

Really appreciate you breaking this down for us and again, any specifics are truly enlightening. Seems like you guys are trying to push the envelope. Of course I am talking about commercially non-sensitive stuff, obviously somewhere in google or secret land they are probably chomping terra-bezel-bytes of data, but for ranges that are pertinent to us, which are large tables of enterprise caliber QlikView clients.

Re: Incremental Loads are dead, long live the CSV

Anonymous — Sat, 23 Aug 2014 09:26:37 GMT

Hi Evan

i will be back to work on Tuesday and will build a more detailed flow diagram of the db > php > scp process for you.

the devs involved have done an awesome job, there are some really nice touches they've done, which you will really like, the summary diagrams are just that summaries, I will break them down even more.

Re: Incremental Loads are dead, long live the CSV

Anonymous — Wed, 27 Aug 2014 12:59:37 GMT

Hi Qlikers

As promised I will go into the server side scripting aspect with a little bit more of an explanation on how we are using the CSV Batch Load process.

PREPARATION

First we will need to analyse our DB table.

We decided that the best and most efficient way of breaking a small/medium/large DB table into manageable chunks was to use an auto-incremental ID field and if one doesn't exist then we would create one.

Now we look at the total record count for our table, lets say we have a 35M record customer table and we want to break the table down by 20K records at a time.

35M / 20K = 1750 chunks, each chunk will be a csv file.

We will create 1750 records in our csv_ranges config table, thus each chunk will have a corresponding entry in the table:

1, users_0-20000, 0, 20000, 10, 0000-00-00 00:00:00

1, users_20001-40000, 20001, 40000, 10, 0000-00-00 00:00:00

1, users_40001-60000, 40001, 60000, 10, 0000-00-00 00:00:00

1, users_60001-80000, 60001, 80000, 10, 0000-00-00 00:00:00

...... etc

We also need to create an entry into the csv_query config table with the DB details for the table and we also add a simple SELECT query "SELECT * FROM users ", that the script will use later.

For our process to constantly run and update the CSV files we created the following DB tables that our script will loop over and extract information etc. I'm just going to list some key fields and not the full schema for each table.

csv_history

Each time a chunk is created or modified it gets added to this table, we can then analyise it to work out how often a chunk get updated, this will enable us to automatically update the refresh value for the chunk in the csv_ranges table.

table_id

chunk_name

timestamp

refresh_value

csv_query

This hold's a basic query string and table connection details.

table_id

table_name

db_name

auto-incremental ID field name

dns

query

enable

csv_ranges

We build multiple ranges for each table that's in the csv_query table.

table_id

chunk_name

min_range

max_range

refresh_value

last_updated

locked_by

md5

I'm not going to explain the reasons why we choose this way of doing things as we could chat all day about it and I just want to give you the juicy details.

One of the problems we encountered was with multiple processes updating the same chunk, so we added a locked_by process field, this eliminates that problem.

OK thats the preparation taken care of.

THE SERVER SIDE SCRIPT

This is a relatively straight forward process and I'm just going to talk about the logic involved not the code, also I'm going to use our 35M record users table as an example, but we loop over the csv_query table which has 100+ tables in it.

We run this script six times on three different backend servers, this gives us a grand total of eighteen processes to create/update the CSVs, the beauty of this process is its fully scalable if we add another 100 tables to chunk, then we just add more backend servers.

FOREACH: Extract our users table config from the csv_query table.

FOREACH: Extract the ranges from the csv_ranges table to loop over.

Has the current chunk for users_0-20000 been created yet.

Lock chunk in the csv_ranges table.

Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};

MD5 results from query and set last_updated timestamp in csv_ranges for current range.

Create CSV file users_0-20000.csv.

SCP CSV file to Qlikview server.

Add entry in the csv_history table.

If the current time is greater than (last_updated + refresh_value) for the users_0-20000 chunk

Lock chunk in the csv_ranges table.

Build query for range: SELECT * FROM users WHERE {auto-incID} >= {min_range} AND {auto-incID} <= {max_range};

MD5 results from query.

If the current MD5 hash is different from new MD5 hash

Replace CSV file users_0-20000.csv.

SCP CSV file to Qlikview server.

Add entry in the csv_history table.

THE QLIKVIEW SERVER

Create a QVW to import CSVs and store as QVDs, its not quite as simple as that, but you get the picture.

+++++++++++

Thats it I'm afraid, I will tidy up all my posts on this subject and add it to the resource library.

If you have any questions on that matter then please ask, or if you want me to video capture the process then the more likes I get will persuade me.

Just a final word on the speed of the process, I rebuilt the 35M users table yesterday and looking at the created time of the first and last QVDs on the Qlikview Server was 10 minutes total creation time, and thats with adding new fields and calculations to the QVW load script .... SWEET AS A NUT