Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Two huge tables with 4 fields in common

I have two big tables and the database isin SQL Server 2005.

Table1 with about 25.000.000 where each row is a unique invoice and Table2 with 50.000.000 rows where each row is a item from an invoice (e.g.: 3 rows for an invoice if the costumer bought 3 different products).

When I try to concatenate the fields that I need to create a Key, in the load I got a logic memory error. I've tried everything, concatenating all fields and creating a key, changing the type of datafield, synckey, etc.

I discovered that if I use varchar(10) or less it works but anything above that doesn'twork. For me to get the right key I need a varchar bigger than 10.

The fields that are common are:

  • Date (smalldatetime in SQL, used date(daystart(timestamp(date,'hh:mm:ss')))in QV)
  • store_id
  • pos_id
  • invoice_id

The problem is that invoice_id isn’t unique foreach store, so I had to consider pos_id. I still got some errors in the resultsand I discovered that the database have some cases where the invoice_id isn’tunique for the same store_id and the same pos_id  depending on the period (don’t ask me why), soI had to consider the date too. So my key became very big.

I tried the concatenate function in several ways like:

In Qlikview: date & store_id & pos_i & invoice_id

In the Select of SQL: cast(date as varchar) + cast(store_id as varchar(3)) + cast(pos_idas varchar(3)) + cast(invoice_id as varchar(6))


The most important table for me is Table2, but I need the information about time from table one. I was able to get it using join between the four fields. So a new column came up in Table2 with the time information:

Table2:

LOAD  date,

store_id,

pos_id,

invoice_id,

product_id,

product_description,

quantity,

price,

revenue

LEFT JOIN (Table2)

Table1:

LOAD date,

time,

invoice_id,

store_id,

pos_id


The thing is I need to identify the unique invoices and if I have just one table at the end I need to be able to count distinct invoices.

Already tried too create a field in table1 with all four fields concatenated, and after that did the same process as for the time but didn't work (memory problem when it reaches a specific number of rows loaded).

It seems that when I have a table with a lot of rows, there cannot be a field with a lot of data.

Can anyone help me with a solution please?

17 Replies
johnw
Champion III
Champion III

Well, here are my results:

key type   peak RAM  final RAM  load time  chart time  QVW size
synthetic   896 MB   568 MB     0:30       797 MS      115 MB
synthetic   893 MB   576 MB     0:31       828 MS     
synthetic   868 MB   569 MB     0:31       844 MS
synthetic   887 MB   569 MB     0:32       797 MS
synthetic   864 MB   569 MB     0:31       859 MS
hash128    1104 MB   740 MB     0:11       797 MS      146 MB
hash128    1082 MB   733 MB     0:37       781 MS
hash128    1065 MB   733 MB     0:16       765 MS
hash128    1063 MB   730 MB     0:17       813 MS
hash128    1096 MB   733 MB     0:17       782 MS

I think this is significantly overstating the memory requirements of the hash128 key.  These rows are very short, so a 16-byte key actually represents a pretty big chunk of memory, which probably explains why the RAM usage is so much higher.  I usually have more and bigger fields in real tables, so the hash128 key would be a much smaller piece of the table.

I don't know why the hash128 load times varied so much.  Nothing else seemed to be going on on my machine.

I didn't keep retesting QVW size, because in theory it should be exactly the same every time as it's storing exactly the same information every time.  I don't remember any deviations from that yesterday.  It also takes a while to store a file of this size.

So I made some right predictions, and some wrong.  I might as well have flipped a coin.

Not applicable
Author

This is becoming very interesting!

Magnus, in one of my attempts before I did exactly as you said, but the same error appears.

I think is because my RAM limitations, because I'm running:

  • InvoideHead with 24m rows
  • InvoiceRow with 42m rows

With the following specs:

  • Windows 7 32-bit
  • 4Gb RAM
  • Intel Core 2 Duo P8600 @ 2,4Ghz

Seems that in 32-bit system a single program can use only 2 Gb of RAM.

johnw
Champion III
Champion III

Right, a 32-bit program can only use 2 GB of RAM, regardless of how much RAM you actually have.  There's a setting that will give you 3 GB, but when I tried it myself, it made my computer unbootable and it took hours of tech help to recover.

I can see it being hard to impossible to load 66 million rows of data into 2 GB of memory or 3 GB for that matter.  You may need a 64-bit machine and 64-bit QlikView if you really need that much data to be accessible at once.

johnw
Champion III
Champion III

Following up on my comment about the danger of using hash128 to make keys, I found an old post where I calculated that for 10 million random keys, the chance of having a hash collision was 0 to at least 25 decimal places.  My conclusion:

"I wouldn't solve the 'generated keys across multiple scripts' problem in this way. But mostly it's because it offends me mathematically, not because there would EVER be a problem in practice."

http://community.qlik.com/message/64902

magavi_framsteg
Partner - Creator III
Partner - Creator III

Hello again.

I was unable to continue testing today.
This will just be a quick reply, I will catch up on this on August 15th unfortunately (holiday).

At that date I will load Head 100m, Lines 400m
That would be a quite nice test

The biggest machine has 1Tb RAM and Xeon 7550 with 4CPUs, each with 8 cores = 32 cores in total - BAD ASS I tell you =))))

John I agree autonumber(w or w/o hash) usually is not possible.
What you can do if possible at the client's DW or DB or whatever they have, is create the mapping table:

compositeKey               key
date+invoice+pos+store 1
date+invoice+pos+store 2
date+invoice+pos+store 3
date+invoice+pos+store ..n

And have your DW-views (preferred) join this key table.

That way incremental loads are possible, and you get a very small and effective key.

This I've found very useful.

Also I did the measurments the same way you did, but I couldn't figure out PeakRAM.
But I guess recording (counters) or just looking at taskman is good enough.

Cassiano:

If you absolutely need all invoice rows and you run out of memory.

Is that during load or during evaluation of your chart?

In the latter you could always enable "show conditions" and/or "calculation conditions".

That way you can force the user to atleast select ONE YearMonth or one store or whatever.

Well, this was a very fun end of week before my summer holiday =))

Thanks to you all and especially John W for giving us this nasty fun interesting problem.

Not applicable
Author

Just to end this discussion my problem is actually the amount of RAM.

During the load process I left my Task Manager opened and watched the amount of RAM that Qlikview was draining. It reached almost 1.850.000 before the error message. Seems that I need more RAM and a 64-bits machine as John mentioned.

If anybody has this kind of problem in the future I advise to do the test above.

I think all the discussion will be helpful to anyone that is looking for information about the options that exists to join large tables needing to use more than just one field for that.

Thank you John and Magnus, and everyone who tried to help! I really appreciate it.

magavi_framsteg
Partner - Creator III
Partner - Creator III

Always glad to be of service.

Kind regards

Magnus Åvitsland

BI Consultant

Framsteg Business Intelligence Corp.

dominicmander
Partner - Creator
Partner - Creator

Just out of interest ...

John's tests seem to suggest that peak RAM could be as much as twice final RAM in calculating your key.

Given this fact, might there be potential in trying to work around your hardware limitation in this case by creating a loop that loads, for example, only a month's worth of data a time, calculates your key using a hash (assuming you're willing to get over the mathematical offence), stores into a prefixed or suffixed qvd, and then drops the table before going around the loop again to load the next month etc etc.

This way you could calculate the keys you need in small chunks, and then load and concatenate the series of qvds created to build your data model seperately without needing to calculate your key all in one go.

I haven't considered whether the final model would still be too big for your hardware to handle, but this way you might overcome the peak RAM overhead from calculating your key, and with a little luck that might be enough to get you on your way.

Of course if, as Magnus suggests, you are able to get a key in the DW/DB that would also solve the problem of the key calculation overhead.