Can I chunk the load of data from the database? - Qlik Community

Report Inappropriate Content · ‎2011-12-15

Hello,

I'm connecting to a SQL Server ( version 2008) database that is quite large.

What my client wants to do to avoid performance problems in the database when the script is running is to somehow divide the loading of data into smaller chunks.

For example the product table we would like to divide into chunks of 10 000 products per chunk.

The output is stored into a qvd file.

This is what we have come up with so far:

set varMaxCustomerId =0;

do while (RowNo() >= 0)

tSubscriptionMatrix:

SQL SELECT TOP 10000

CustomerId,

IsActive,

LastUpdated,

SubscriptionChannelId,

SubscriptionMediaId

FROM cdon.product.tSubscriptionMatrix WITH(NOLOCK)

WHERE CustomerId > $(varMaxCustomerId)

GROUP BY

CustomerId,

IsActive,

LastUpdated,

SubscriptionChannelId,

SubscriptionMediaId;

tSubscriptionMatrix2:

Load

max(CustomerId) as MaxCustomerId

Resident tSubscriptionMatrix;

set varMaxCustomerId = peek('MaxCustomerId',0,'tSubscriptionMatrix2');

store * from tSubscriptionMatrix into C:\Tmp6\cdon.product.tSubscriptionMatrix.QVD (QVD);

DROP Table tSubscriptionMatrix;

loop

The problem at the moment seems to be this row WHERE CustomerId > $(varMaxCustomerId) which is not accepted by the SQL server.

Do not look only at the solution above i.e. we are open to all ideas that can move us forward.

What we want to achieve is a solution that enables us to reduce the load / be as "light" as possible on the database.

Is this possble?

br

Martin

fosuzuki · ‎2011-12-15

Hi Martin,

have you considered a incremental load strategy? The first reload will be heavier since it will get all rows, but in the next reloads you would only get from the DB the created/modified rows since the last reload...

If you need more details on how to do that, feel free to ask. I just don't want to try to explain something that you might already know...

Regards,

Fernando

Report Inappropriate Content · ‎2011-12-16

Hello,

Using incremental load is the plan, but like you say the first load will be heavier and I'm looking for ways to make it a little less heavy on the DB. The plan is make a full load every month and then use incremental loads in between.

Anyone that has come across this this problem?

Surely there are a lot of DBA's or system administrators that do not like idea of giving access to the database to an external program so this has got to be a fairly common issue.

br

Martin

johnw · ‎2011-12-16

We've "chunked" one of our loads by loading one year per chunk, with a daily incremental load and a monthly full load. Similar to your example, it just loads in a loop, passing a start and end date into the SQL for each by using variables. This is combined with incremental loading. Here's the most relevant section of the SQL:

AND JHJHD BETWEEN (99999999 - $(histEnd)) AND (99999999 - $(histStart)) // JHJHD is in 9's complement
AND JHRCD >= $(LAST_RELOAD_DATE1)

So the first line describes the chunk, and the second describes the increment.

Now, what exactly do you MEAN by the load on the database? If we're only talking about the time required to satisfy the query, don't bother with chunking. A row must either be read or not. You will have an index to easily get the rows you want or not. If you're talking about memory usage, that may depend on your DBMS. I don't know how SQL Server 2008 will satisfy your query. If it builds the entire returned table in memory before passing it to you, chunking could reduce memory requirements.

For us, chunking only helps our QlikView server load, not our database management system load. The database load is actually increased, particularly so on the monthly full loads, where it uses repeated index scans rather than just grabbing the entire table with a table scan. For us, the most efficient way to hit the database would be to NOT chunk the data.

If the SQL doesn't like WHERE CustomerId > $(varMaxCustomerId), then you have some problem unrelated to chunking. For instance, maybe your CustomerId is alphabetic, and so is supposed to be in single quotes. Whatever is in the variable should be inserted literally into the SQL. It just needs to be something the SQL can evaluate properly.

llauses243 · ‎2011-12-16

Hi,

This is my offer, pls to see image attached

Good luck, Luis

Report Inappropriate Content · ‎2011-12-19

It sounds like you are running a query on a live transaction database.

If that is true, bottom line, you will have to stop doing that. Physically.

Here are three strategies, you choose the one you prefer.

Option 1: Schedule an overnight dump via stored proc to a text file, or daily text files; read the text. Same thing is possible using MDB format, if there is not too much data. (If you have Informatica or another ETL tool, use that instead of SQL SPs)

Option 2: IFF you have backups running, set up a production job on a second server to restore a backup image of the data , then load from the second server

Option 3: Backup the DB, restore it on another server, then put production stored procedures in place to incrementally add rows from the live system to the copy. Then load from that server.

As for what you're attempting, don't.

You'll get caught, sooner or later, with a lock or lagging unclosed connection. Skipping the techie details, my guess is that the transaction system is constantly creating new row / customer IDs, in which case you'll really cause problems if you try to use that key in a query.

johnw · ‎2011-12-19

sam.horton wrote:
It sounds like you are running a query on a live transaction database.
If that is true, bottom line, you will have to stop doing that. Physically.
...
As for what you're attempting, don't.

I don't use SQL Server, so perhaps I'm just being naive, but that seems a strange assertion.

We run queries all the time against our live transactional data in DB2, both from QlikView and within the transactional system itself. I would certainly hope that most database management systems assume that you might want to read data as well as write it. In DB2, you avoid most locking by specifying FOR FETCH ONLY. It looks like the WITH(NOLOCK) is the same concept for SQL Server. As for unclosed connections, use DISCONNECT in your script when you no longer need the connection, plus QlikView will disconnect automatically at the end of the script if you haven't done it manually.

Our transactional system is constantly creating new rows/keys in the table we're doing the chunking on. As I recall, we scan the primary key (clustering index) during the retrieval. This doesn't cause any serious problems. We do have an issue where the clustering index isn't in the natural order that the data gets written over time, and so recent data is highly disorganized, and that's the very data we're retrieving in the incremental load. Performance would be faster if we had a dedicated index in an appropriate order, but the DBA and I agree that this would make little overall difference, and isn't worth the overhead.

How does "an overnight dump via stored proc to a text file" save database resources? Is there something about the stored procedure that is more efficient in SQL Server? Perhaps naively, it seems to me that either way you're running SQL against the database. In OUR environment, I believe that doing it directly in QlikView is actually BETTER. That's specific to our hardware and licensing agreements, but in our case, the ODBC connection from QlikView is allowed to run on a separate and faster physical processor than the main ones our mainframe uses, and so it doesn't compete for CPU resources with the transactional system (even if it still competes for other resources).

If options 2 or 3 are practical, I agree they'll reduce load on the transactional database. But if the production environment is so sensitive that running an indexed query against a small portion of the transactional database once per day will cause a problem, in my opinion the production environment need a serious overhaul to be made more robust. And if it isn't an indexed query against a small portion of the transactional database, it seems likely that they could make some basic improvements to the incremental load process (e.g., add an index) to significantly reduce the load on the transactional database.

Report Inappropriate Content · ‎2011-12-19

Good question.

Not knowing anything about the physical environment of the original poster, those are my recommendations to get the solution stable and in production over time in SQL Server.

As for DB2, that, Oracle and TD are different, very very different, from SQL Server.

/end battle scar result post

johnw · ‎2011-12-19

Heh. Fair enough. I guess I assumed more similarities than actually exist.

Report Inappropriate Content · ‎2011-12-19

My "keep things simple" approach would be as follows:

1. Create a duplicate of the Production database from the most recent scheduled backup

2. Load the initial state of your QVD file from this duplicate

3. Use native QlikView incremental load from the Production database from then on

A number of posts have mentioned indexes. I'm stating the obvious, but you MUST have an index on the field that you are using for the incremental load.