Solved: Two huge tables with 4 fields in common - Qlik Community

Report Inappropriate Content · ‎2011-07-07

I have two big tables and the database isin SQL Server 2005.

Table1 with about 25.000.000 where each row is a unique invoice and Table2 with 50.000.000 rows where each row is a item from an invoice (e.g.: 3 rows for an invoice if the costumer bought 3 different products).

When I try to concatenate the fields that I need to create a Key, in the load I got a logic memory error. I've tried everything, concatenating all fields and creating a key, changing the type of datafield, synckey, etc.

I discovered that if I use varchar(10) or less it works but anything above that doesn'twork. For me to get the right key I need a varchar bigger than 10.

The fields that are common are:

Date (smalldatetime in SQL, used date(daystart(timestamp(date,'hh:mm:ss')))in QV)
store_id
pos_id
invoice_id

The problem is that invoice_id isn’t unique foreach store, so I had to consider pos_id. I still got some errors in the resultsand I discovered that the database have some cases where the invoice_id isn’tunique for the same store_id and the same pos_id depending on the period (don’t ask me why), soI had to consider the date too. So my key became very big.

I tried the concatenate function in several ways like:

In Qlikview: date & store_id & pos_i & invoice_id

In the Select of SQL: cast(date as varchar) + cast(store_id as varchar(3)) + cast(pos_idas varchar(3)) + cast(invoice_id as varchar(6))

The most important table for me is Table2, but I need the information about time from table one. I was able to get it using join between the four fields. So a new column came up in Table2 with the time information:

Table2:

LOAD date,

store_id,

pos_id,

invoice_id,

product_id,

product_description,

quantity,

price,

revenue

LEFT JOIN (Table2)

Table1:

LOAD date,

time,

invoice_id,

store_id,

pos_id

The thing is I need to identify the unique invoices and if I have just one table at the end I need to be able to count distinct invoices.

Already tried too create a field in table1 with all four fields concatenated, and after that did the same process as for the time but didn't work (memory problem when it reaches a specific number of rows loaded).

It seems that when I have a table with a lot of rows, there cannot be a field with a lot of data.

Can anyone help me with a solution please?

johnw · ‎2011-07-14

Right, a 32-bit program can only use 2 GB of RAM, regardless of how much RAM you actually have. There's a setting that will give you 3 GB, but when I tried it myself, it made my computer unbootable and it took hours of tech help to recover.

I can see it being hard to impossible to load 66 million rows of data into 2 GB of memory or 3 GB for that matter. You may need a 64-bit machine and 64-bit QlikView if you really need that much data to be accessible at once.

View solution in original post

johnw · ‎2011-07-07

Well, you can count distinct invoices with count(distinct invoice_id). If one table isn't the solution to your problem, what happens if you leave the tables as they were? By that I mean what happens if you leave the date field, store_id field, pos_id field and invoice_id alone and on their respective tables? I would expect a 4-field synthetic key to be formed, but if it fits in your memory and gives the right answers, there's nothing specifically wrong with this synthetic key.

nagaiank · ‎2011-07-07

Neither concatenating nor joining is a sound approach to handle header-detail type files from database design point of view. Table1 is a header file having one row for each unique invoice (probably defined by a group of fields such as invoice_id, store_id, pos_id, date, etc.). Table2 is a detail file which may contain one or more rows for each unique invoice. These tables are to be linked (in QlikView sense) by the unique invoice if you need to report using data in both tables.

As advised by John, keeping the files separate but linked seems to be a good or the only correct approach.

If you are concerned about the synthetic keys, we may eliminate them by using a combined key (e.g. with Autonumber function) and renaming some fields in one of the tables.

Report Inappropriate Content · ‎2011-07-07

Thanks for the replies guys but I've tried working with one table, working with two tables and synthetic key.... in all the scenarios I couldn't load all records because of the memory problem.

I've been able to do the first challenge as I said after a lot of tries joining the two tables and getting the time from table1 to table2. I've loaded all the records without a problem and it worked. But I wasn't able to count distinct invoices.

I even tried to create a new column in the original database in SQL for table1 with a unique value (1 to N) and load this field, but I still got the same problem. I think the number is to big (25.000.000).

Using autonumber function will get me in the same as I described above and another problem for autonumber is that I'll use incremental load in the future for all this.

Anonymous · ‎2011-07-08

Have you tried to create a map table of table 1 using autonumber(date,store,pos,Invoice) and then use applymap when loading table 2 to get the time? Henrik

Report Inappropriate Content · ‎2011-07-08

No, but I've tried just now and didn't work resulting in the same memory error: "OUT OF VIRTUAL AND/OR LOGICAL MEMORY, allocating 2MB".

Maybe instead concatenating 4 fields and creating a unique key I can try concatenating just 2 in key1 and the other 2 in key2 and use two keys?

I'm starting to think that I cannot load all the data. How can I know the limit? I think the problem is a big field where I have 1 to 25.000.000 and a big table with more then 40.000.000 rows.

magavi_framsteg · ‎2011-07-13

Hi Cassiano.Reis.

I have experience with VERY large clients, my company is a QlikTech Elite Partner.

Tables of 300m+ rows are common.

When working with large sets of data, the data model is *VERY* important.

You must NEVER have syntetic keys or loops.

Even joining two big tables, like invoice header and invoice rows, which both usually are massive tables, takes tremendous amounts of RAM, we're talking gigs here. And if it's joing on four fields (syntetic keys) it would never be finished.

So my tip is:

InvoiceHead:

LOAD

date & '~' & store_id & '~' & pos_i & '~' & invoice_id as [%InvoiceKey],

date as InvoiceHeadDate,

store_id,

pos_id,

invoice_id,

product_id,

product_description,

quantity,

price,

revenue

InvoiceRow:

LOAD

date & '~' & store_id & '~' & pos_i & '~' & invoice_id as [%InvoiceKey],

date as InvoiceRowDate,

time

//invoice_id,

//store_id,

//pos_id

And if you absolutely need to join the tables, join on ONE key, in this case [%InvoiceKey], joining on four fields will make you run out of memory.

johnw · ‎2011-07-13

Magnus Åvitsland wrote:
...
When working with large sets of data, the data model is *VERY* important.
You must NEVER have syntetic keys or loops.
Even joining two big tables, like invoice header and invoice rows, which both usually are massive tables, takes tremendous amounts of RAM, we're talking gigs here. And if it's joing on four fields (syntetic keys) it would never be finished.
...
And if you absolutely need to join the tables, join on ONE key, in this case [%InvoiceKey], joining on four fields will make you run out of memory.

Perhaps I should explain further why I suggest leaving the synthetic key in place.

To the best of my knowledge, synthetic keys use slightly less memory and perform slightly faster than composite keys of the same structure. To the best of my knowledge, that's because synthetic keys and composite keys are almost identical internally in QlikView, with probably a tiny bit less overhead for synthetic keys since you can't actually reference the table or key value directly.

That would indicate that synthetic keys are merely no worse than composite keys. However, in my experience, they're significantly better because of load time. While some of us hypothesize that there may be a case where a synthetic key loads more slowly, all examples I've ever seen show a significantly faster load time for the synthetic key than the equivalent composite key.

So let's take your example. The composite key you suggest is not quite a 1:1 replacement for the synthetic key since it isn't given a separate table, but it is functionally identical and I think it should give similar performance results. Certainly if we got some performance improvement out of moving it to one of the associated tables I'd be surprised, and it would be something I'd want to know. So it should do just fine for our purposes.

I've randomized data and stuck it into QVDs for our sources, and I make sure I don't get an optimized load so that we're comparing apples to apples. I will compare performance loading the tables raw (and forming a 4-field synthetic key) vs. doing what you suggested with a composite key. I'm using 2 million rows for the invoice headers and 8 million rows for invoice lines. I'd use more, but I'm on an old computer and it'd start swapping to disk or run out of memory. If you theorize that the ratios of performance measurements change in some way with ten or fifty times the volume, someone else may need to run the same test to see (application attached). The chart mentioned is a sum of revenue by product (I've moved Revenue to the invoice line table so that both tables and all rows are used for this chart). The chart time is measured by turning off caching, putting it on its own tab, switching to that tab when the load is complete and the CPU has gone back to 0, and checking the CalcTime. I ran multiple trials. Testing was done with 32-bit version 10 SR2.

.           peak RAM final RAM load time chart time QVW size
synthetic   760000 K 424000 K   0:37       812 MS      149000 K
synthetic   740000 K 424000 K   0:39       828 MS      149000 K
synthetic   760000 K 424000 K   0:38       812 MS      149000 K
composite   821000 K 493000 K   1:01       797 MS      163000 K
composite   819000 K 493000 K   0:59       797 MS      163000 K
composite   821000 K 495000 K   0:59       797 MS      163000 K

Some of these numbers did surprise me a little. I did expect the composite key to use more memory, but only slightly, almost negligibly. Instead, the synthetic key is significantly better in this regard. That further cements my recommendation to leave the synthetic key in place in this case, since memory is apparently the critical resource for this load, and is conserved by using a synthetic key. Second, I expected chart times to be nearly identical. They are, but there IS a tiny but consistent edge to the composite key in these measurements. I'm going to guess that that's because it isn't an exact 1:1 replacement for the synthetic key, so there's one less link to follow, but that IS an advantage if it's real and not just noise in the data.

I've attached the application I used for this testing if someone wishes to duplicate or contradict my results on a different computer.

For more information on synthetic keys, I recommend this thread:

http://community.qlik.com/thread/10279?start=30&tstart=0

Edit: We typically autonumber() a composite key like this, so I tried that for comparison:

key type    peak RAM final RAM load time chart time QVW size
synthetic   760000 K 424000 K   0:37       812 MS      149000 K
synthetic   740000 K 424000 K   0:39       828 MS      149000 K
synthetic   760000 K 424000 K   0:38       812 MS      149000 K
composite   821000 K 493000 K   1:01       797 MS      163000 K
composite   819000 K 493000 K   0:59       797 MS      163000 K
composite   821000 K 495000 K   0:59       797 MS      163000 K
autonumber 897000 K 405000 K   1:42       781 MS      146000 K
autonumber 899000 K 407000 K   1:42       813 MS      146000 K
autonumber 898000 K 407000 K   1:42       828 MS      146000 K

The load slows down a lot and uses more peak RAM, so this is probably a poor solution in this case. However, the final RAM and file size usage is lower than the synthetic key. That may be useful in some cases.

magavi_framsteg · ‎2011-07-14

Hi John.

Thanks for your very interesting reply.

I cannot hesitate participating in analyzing this challenging and interesting problem.

In fact, I am testing your app right now using QV Desktop v9 SR7 x64.

The good thing is I have massive servers at my client.

I did some modifications to the script, it is attached to this post:

testSyntheticKey7.qvw

What i did was to pre-create the composite key, because this is usually what you do. Specifically when the QVD is loaded in more than one file.

But for the cases of load times, I create the composite key in the load script, it only exists in the QVD for quicker modelling and calculations with QV Optimizer.

The specs are:

- Physical Windows Server 2008 R2 x64

- 128Gb RAM

- Intel Xeon E5530 @ 2,4Ghz

- 2 CPU:s x4 cores = 8 cores in total

I will send my results later today, or tomorrow.

Can you please explain how you did all your measures, like peak RAM, final RAM etc.

Thanks.

My results so far:

InvoideHead with 2m rows

InvoiceRow with 8m rows

key type    peak RAM final RAM load time chart time QVW size
synthetic        xxxxxx K 456872 K   1:20       343 MS      149000 K
synthetic        xxxxxx K 452296 K   1:21       344 MS      149000 K
synthetic        xxxxxx K 452020 K   1:16       312 MS      149000 K

composite     xxxxxx K 509164 K   5:27       312 MS      163000 K
composite     xxxxxx K 511024 K   5:26       296 MS      163000 K
composite     xxxxxx K 511504 K   5:28       328 MS      163000 K

autonumber xxxxxx K 442524 K   5:39       296 MS      117000 K
autonumber xxxxxx K 484360 K   5:23       328 MS      117000 K
autonumber xxxxxx K 485336 K   5:25       296 MS      117000 K

InvoideHead with 20m rows

InvoiceRow with 80m rows

Optimized load. Composite keys pre-created in QVDs, otherwise the test would take too long.

But since it's so much data the load times are long enough for a (rather) scientific test anyways.

key type    peak RAM final RAM load time chart time QVW size
synthetic        xxxxxx Kb 5847468 Kb   5:29       2995 ms     1173000 Kb
synthetic        xxxxxx Kb 5847504 Kb   5:58       2918 ms     1173000 Kb
synthetic        xxxxxx Kb 5850380 Kb   6:01       2917 ms     1173000 Kb

composite     xxxxxx Kb 7352208 Kb   3:23       2901 ms     1185451 Kb
composite     xxxxxx Kb 7353248 Kb   3:37       2964 ms     1185451 Kb
composite     xxxxxx Kb 7353180 Kb   3:48       3026 ms     1185451 Kb

autonumber xxxxxx Kb 5658888 Kb   3:10       2964 ms     1012265 Kb
autonumber xxxxxx Kb 5655144 Kb   3:19       2948 ms     1012265 Kb
autonumber xxxxxx Kb 5659204 Kb   3:02       2932 ms     1012265 Kb

Conclusion:

I should have run another test with InvoiceHead 50 000 000 rows and InvoiceLine 200 000 000 rows, but there was simply no time for that, these tests took almost the whole day.

I cannot compare the above results because the test on 20m and 80m the key fields were calculated on the fly by the load script.

But for the second test I absolutely needed to pre-create them in the QVD to get the optimized load, else I would have needed some automated test tools to run them.

Usually composite keys are in face pre-created in the QVD-layer (data layer) so that the application layer can load all QVD:s optimized.

In terms of load time I think composite keys have the edge.

Also when it comes to RAM-usage, composite keys use slightly less.

Synthetic keys though seem to be more effective, faster calculations.

So I would have to re-run the first test.

No evidence yet.

Uploaded testSyntheticKey7.qvw

johnw · ‎2011-07-14

Thank you for taking this on. I'm VERY curious how things change or stay the same with larger volumes of data, and as we make the test a bit more realistic.

I measured peak RAM by having Windows Task Manager open and watching the memory usage of the QlikView process. I recorded the highest number I saw. I measured final RAM the same way, looking at memory usage of the process once it completed. Load time I calculated from the start and end timestamps in the log. Chart time I already mentioned, and QVW size should be obvious.

I agree that in the real world, you'd create the composite keys during creation of the QVDs, so I probably should have done that as well. However, that time is taken SOMEWHERE, so in that case, I would have needed to measure the difference in time for creating the QVDs with and without the composite keys and added that in. It seemed simpler and hopefully fair to create the composite keys during the load, and just make sure both loads were unoptimized so I wasn't getting an artificial advantage. With the more realistic approach, you'd want two sets of QVDs, one with the composite key and one without.

What are you doing for the autonumber tests? Does the QVD contain autonumbered keys? In many cases, that would be unrealistic, as QVDs are often created by different QVWs. However, if we knew that autonumbered keys were better (I believe they are if they can be done), it would be a simple matter to do both the invoice header QVD and the invoice row QVD in the same QVW so that the autonumber results were consistent. Hmmm, this could also present problems for incremental loads, as we'd likely be using in practice. Most of mine load from the database first, then from the QVD. I think that would make it hard to autonumber. We could use a hash128() perhaps, but I wouldn't. Even though in practice you'd never have a collision, it is POSSIBLE to have a collision, and I don't like opening that door. Still, the chances are so low that we might as well be afraid of cosmic rays causing load problems as well. Or maybe I calculated the chances with hash256(). I can't remember. Probably hash128() is a practical solution for autonumbered keys when we're using incremental loads.

Hmmm, speaking of incremental loads, that might mean your load times are more fair than I thought, and don't need to include the time to create the composite key. Most of our large QVDs like this are created incrementally. Our largest is perhaps twenty million rows. But we certainly don't load twenty million rows from the database every time we create the QVD. It's an incremental load, and we probably only load tens of thousands of rows from the database each time. That time is negligible compared to the time to create a synthetic key across tens of millions of rows every time you load FROM the QVD.

Another good reason to have composite keys is the frequency with which we use exists(key) when loading from QVDs. In many cases, having it available will speed things up regardless of whether you're ultimately building synthetic keys or not. And if you've already built composite keys, why build a synthetic key to do the same thing?

I feel like I'm rambling, but hopefully it's good rambling.

Edit: I plan to modify my test to create a set of QVDs with a hash128() key. I won't count the extra time required to create the key because we are assumed to be using incremental loads. With that QVD in place, I expect the hash128() key approach to beat the synthetic key approach in all of the measures we're making except for peak RAM. That is, of course, the limitation that spawned this thread, so I may still recommend synthetic keys in this particular example. But where that isn't an issue, I expect to be backing off from my position on synthetic keys. I don't know if I'll be able to get to this today, unfortunately. I have several tasks that I'm behind on.