17 Replies Latest reply: Sep 19, 2011 11:11 AM by Dominic Mander RSS

    Two huge tables with 4 fields in common

      I have two big tables and the database isin SQL Server 2005.

       

      Table1 with about 25.000.000 where each row is a unique invoice and Table2 with 50.000.000 rows where each row is a item from an invoice (e.g.: 3 rows for an invoice if the costumer bought 3 different products).

       

      When I try to concatenate the fields that I need to create a Key, in the load I got a logic memory error. I've tried everything, concatenating all fields and creating a key, changing the type of datafield, synckey, etc.

       

      I discovered that if I use varchar(10) or less it works but anything above that doesn'twork. For me to get the right key I need a varchar bigger than 10.

       

      The fields that are common are:

      • Date (smalldatetime in SQL, used date(daystart(timestamp(date,'hh:mm:ss')))in QV)
      • store_id
      • pos_id
      • invoice_id

       

      The problem is that invoice_id isn’t unique foreach store, so I had to consider pos_id. I still got some errors in the resultsand I discovered that the database have some cases where the invoice_id isn’tunique for the same store_id and the same pos_id  depending on the period (don’t ask me why), soI had to consider the date too. So my key became very big.

       

      I tried the concatenate function in several ways like:

       

      In Qlikview: date & store_id & pos_i & invoice_id

       

      In the Select of SQL: cast(date as varchar) + cast(store_id as varchar(3)) + cast(pos_idas varchar(3)) + cast(invoice_id as varchar(6))


      The most important table for me is Table2, but I need the information about time from table one. I was able to get it using join between the four fields. So a new column came up in Table2 with the time information:

       

      Table2:

      LOAD  date,

      store_id,

      pos_id,

      invoice_id,

      product_id,

      product_description,

      quantity,

      price,

      revenue

      LEFT JOIN (Table2)

       

      Table1:

      LOAD date,

      time,

      invoice_id,

      store_id,

      pos_id


      The thing is I need to identify the unique invoices and if I have just one table at the end I need to be able to count distinct invoices.

       

      Already tried too create a field in table1 with all four fields concatenated, and after that did the same process as for the time but didn't work (memory problem when it reaches a specific number of rows loaded).

       

      It seems that when I have a table with a lot of rows, there cannot be a field with a lot of data.

       

      Can anyone help me with a solution please?

        • Re: Two huge tables with 4 fields in common
          John Witherspoon

          Well, you can count distinct invoices with count(distinct invoice_id).  If one table isn't the solution to your problem, what happens if you leave the tables as they were?  By that I mean what happens if you leave the date field, store_id field, pos_id field and invoice_id alone and on their respective tables?  I would expect a 4-field synthetic key to be formed, but if it fits in your memory and gives the right answers, there's nothing specifically wrong with this synthetic key.

            • Two huge tables with 4 fields in common

              Thanks for the replies guys but I've tried working with one table, working with two tables and synthetic key.... in all the scenarios I couldn't load all records because of the memory problem.

               

              I've been able to do the first challenge as I said after a lot of tries joining the two tables and getting the time from table1 to table2. I've loaded all the records without a problem and it worked. But I wasn't able to count distinct invoices.

               

              I even tried to create a new column in the original database in SQL for table1 with a unique value (1 to N) and load this field, but I still got the same problem. I think the number is to big (25.000.000).

               

              Using autonumber function will get me in the same as I described above and another problem for autonumber is that I'll use incremental load in the future for all this.

                • Two huge tables with 4 fields in common
                  Henrik Matz

                  Have you tried to create a map table of table 1 using autonumber(date,store,pos,Invoice) and then use applymap when loading table 2 to get the time? Henrik

                    • Two huge tables with 4 fields in common

                      No, but I've tried just now and didn't work resulting in the same memory error: "OUT OF VIRTUAL AND/OR LOGICAL MEMORY, allocating  2MB".

                       

                      Maybe instead concatenating 4 fields and creating a unique key I can try concatenating just 2 in key1 and the other 2 in key2 and use two keys?

                       

                      I'm starting to think that I cannot load all the data. How can I know the limit? I think the problem is a big field where I have 1 to 25.000.000 and a big table with more then 40.000.000 rows.

                • Two huge tables with 4 fields in common
                  Nagaian Krishnamoorthy

                  Neither concatenating nor joining is a sound approach to handle header-detail type files from database design point of view. Table1 is a header file having one row for each unique invoice (probably defined by a group of fields such as invoice_id, store_id, pos_id, date, etc.). Table2 is a detail file which may contain one or more rows for each unique invoice. These tables are to be linked (in QlikView sense) by the unique invoice if you need to report using data in both tables.

                   

                  As advised by John, keeping the files separate but linked seems to be a good or the only correct approach.

                   

                  If you are concerned about the synthetic keys, we may eliminate them by using a combined key (e.g. with Autonumber function) and renaming some fields in one of the tables.

                  • Re: Two huge tables with 4 fields in common
                    Magnus ÅVITSLAND

                    Hi Cassiano.Reis.

                     

                    I have experience with VERY large clients, my company is a QlikTech Elite Partner.

                    Tables of 300m+ rows are common.

                     

                    When working with large sets of data, the data model is *VERY* important.

                    You must NEVER have syntetic keys or loops.

                     

                    Even joining two big tables, like invoice header and invoice rows, which both usually are massive tables, takes tremendous amounts of RAM, we're talking gigs here. And if it's joing on four fields (syntetic keys) it would never be finished.


                    So my tip is:

                     

                    InvoiceHead:

                    LOAD

                    date & '~' & store_id & '~' & pos_i & '~' & invoice_id as [%InvoiceKey],

                    date as InvoiceHeadDate,

                    store_id,

                    pos_id,

                    invoice_id,

                    product_id,

                    product_description,

                    quantity,

                    price,

                    revenue

                     

                    InvoiceRow:

                    LOAD

                    date & '~' & store_id & '~' & pos_i & '~' & invoice_id as [%InvoiceKey],

                    date as InvoiceRowDate,

                    time

                    //invoice_id,

                    //store_id,

                    //pos_id

                     

                    And if you absolutely need to join the tables, join on ONE key, in this case [%InvoiceKey], joining on four fields will make you run out of memory.

                      • Re: Two huge tables with 4 fields in common
                        John Witherspoon

                        Magnus Åvitsland wrote:

                        ...

                        When working with large sets of data, the data model is *VERY* important.

                        You must NEVER have syntetic keys or loops.

                         

                        Even joining two big tables, like invoice header and invoice rows, which both usually are massive tables, takes tremendous amounts of RAM, we're talking gigs here. And if it's joing on four fields (syntetic keys) it would never be finished.

                        ...

                        And if you absolutely need to join the tables, join on ONE key, in this case [%InvoiceKey], joining on four fields will make you run out of memory.

                         

                        Perhaps I should explain further why I suggest leaving the synthetic key in place.

                         

                        To the best of my knowledge, synthetic keys use slightly less memory and perform slightly faster than composite keys of the same structure.  To the best of my knowledge, that's because synthetic keys and composite keys are almost identical internally in QlikView, with probably a tiny bit less overhead for synthetic keys since you can't actually reference the table or key value directly.

                         

                        That would indicate that synthetic keys are merely no worse than composite keys.  However, in my experience, they're significantly better because of load time.  While some of us hypothesize that there may be a case where a synthetic key loads more slowly, all examples I've ever seen show a significantly faster load time for the synthetic key than the equivalent composite key.

                         

                        So let's take your example.  The composite key you suggest is not quite a 1:1 replacement for the synthetic key since it isn't given a separate table, but it is functionally identical and I think it should give similar performance results.  Certainly if we got some performance improvement out of moving it to one of the associated tables I'd be surprised, and it would be something I'd want to know.  So it should do just fine for our purposes.

                         

                        I've randomized data and stuck it into QVDs for our sources, and I make sure I don't get an optimized load so that we're comparing apples to apples.  I will compare performance loading the tables raw (and forming a 4-field synthetic key) vs. doing what you suggested with a composite key.  I'm using 2 million rows for the invoice headers and 8 million rows for invoice lines.  I'd use more, but I'm on an old computer and it'd start swapping to disk or run out of memory.  If you theorize that the ratios of performance measurements change in some way with ten or fifty times the volume, someone else may need to run the same test to see (application attached).  The chart mentioned is a sum of revenue by product (I've moved Revenue to the invoice line table so that both tables and all rows are used for this chart).  The chart time is measured by turning off caching, putting it on its own tab, switching to that tab when the load is complete and the CPU has gone back to 0, and checking the CalcTime.  I ran multiple trials.  Testing was done with 32-bit version 10 SR2.

                         

                        .           peak RAM  final RAM  load time  chart time  QVW size
                        synthetic   760000 K  424000 K   0:37       812 MS      149000 K
                        synthetic   740000 K  424000 K   0:39       828 MS      149000 K
                        synthetic   760000 K  424000 K   0:38       812 MS      149000 K
                        composite   821000 K  493000 K   1:01       797 MS      163000 K
                        composite   819000 K  493000 K   0:59       797 MS      163000 K
                        composite   821000 K  495000 K   0:59       797 MS      163000 K

                         

                        Some of these numbers did surprise me a little.  I did expect the composite key to use more memory, but only slightly, almost negligibly.  Instead, the synthetic key is significantly better in this regard.  That further cements my recommendation to leave the synthetic key in place in this case, since memory is apparently the critical resource for this load, and is conserved by using a synthetic key.  Second, I expected chart times to be nearly identical.  They are, but there IS a tiny but consistent edge to the composite key in these measurements.  I'm going to guess that that's because it isn't an exact 1:1 replacement for the synthetic key, so there's one less link to follow, but that IS an advantage if it's real and not just noise in the data.

                         

                        I've attached the application I used for this testing if someone wishes to duplicate or contradict my results on a different computer.

                         

                        For more information on synthetic keys, I recommend this thread:

                         

                        http://community.qlik.com/thread/10279?start=30&tstart=0

                         

                        Edit:  We typically autonumber() a composite key like this, so I tried that for comparison:

                         

                        key type    peak RAM  final RAM  load time  chart time  QVW size
                        synthetic   760000 K  424000 K   0:37       812 MS      149000 K
                        synthetic   740000 K  424000 K   0:39       828 MS      149000 K
                        synthetic   760000 K  424000 K   0:38       812 MS      149000 K
                        composite   821000 K  493000 K   1:01       797 MS      163000 K
                        composite   819000 K  493000 K   0:59       797 MS      163000 K
                        composite   821000 K  495000 K   0:59       797 MS      163000 K
                        autonumber  897000 K  405000 K   1:42       781 MS      146000 K
                        autonumber  899000 K  407000 K   1:42       813 MS      146000 K
                        autonumber  898000 K  407000 K   1:42       828 MS      146000 K

                         

                        The load slows down a lot and uses more peak RAM, so this is probably a poor solution in this case.  However, the final RAM and file size usage is lower than the synthetic key.  That may be useful in some cases.

                          • Re: Two huge tables with 4 fields in common
                            Magnus ÅVITSLAND

                            Hi John.

                             

                            Thanks for your very interesting reply.

                            I cannot hesitate participating in analyzing this challenging and interesting problem.

                            In fact, I am testing your app right now using QV Desktop v9 SR7 x64.

                            The good thing is I have massive servers at my client.

                             

                            I did some modifications to the script, it is attached to this post:

                            testSyntheticKey7.qvw

                            What i did was to pre-create the composite key, because this is usually what you do. Specifically when the QVD is loaded in more than one file.

                            But for the cases of load times, I create the composite key in the load script, it only exists in the QVD for quicker modelling and calculations with QV Optimizer.

                             

                            The specs are:

                            - Physical Windows Server 2008 R2 x64

                            - 128Gb RAM

                            - Intel Xeon E5530 @ 2,4Ghz

                            - 2 CPU:s x4 cores = 8 cores in total

                             

                            I will send my results later today, or tomorrow.

                            Can you please explain how you did all your measures, like peak RAM, final RAM etc.

                            Thanks.

                             

                            My results so far:

                             

                            InvoideHead with 2m rows

                            InvoiceRow with 8m rows

                             

                            key type    peak RAM  final RAM  load time  chart time  QVW size
                            synthetic        xxxxxx K  456872 K   1:20       343 MS      149000 K
                            synthetic        xxxxxx K  452296 K   1:21       344 MS      149000 K
                            synthetic        xxxxxx K  452020 K   1:16       312 MS      149000 K

                             

                            composite     xxxxxx K  509164 K   5:27       312 MS      163000 K
                            composite     xxxxxx K  511024 K   5:26       296 MS      163000 K
                            composite     xxxxxx K  511504 K   5:28       328 MS      163000 K

                             

                            autonumber  xxxxxx K  442524 K   5:39       296 MS      117000 K
                            autonumber  xxxxxx K  484360 K   5:23       328 MS      117000 K
                            autonumber  xxxxxx K  485336 K   5:25       296 MS      117000 K

                             

                             

                             

                             

                            InvoideHead with 20m rows

                            InvoiceRow with 80m rows

                            Optimized load. Composite keys pre-created in QVDs, otherwise the test would take too long.

                            But since it's so much data the load times are long enough for a (rather) scientific test anyways.

                             

                            key type    peak RAM  final RAM  load time  chart time  QVW size
                            synthetic        xxxxxx Kb  5847468 Kb   5:29       2995 ms     1173000 Kb
                            synthetic        xxxxxx Kb  5847504 Kb   5:58       2918 ms     1173000 Kb
                            synthetic        xxxxxx Kb  5850380 Kb   6:01       2917 ms     1173000 Kb

                             

                            composite     xxxxxx Kb  7352208 Kb   3:23       2901 ms     1185451 Kb
                            composite     xxxxxx Kb  7353248 Kb   3:37       2964 ms     1185451 Kb
                            composite     xxxxxx Kb  7353180 Kb   3:48       3026 ms     1185451 Kb

                             

                            autonumber  xxxxxx Kb  5658888 Kb   3:10       2964 ms     1012265 Kb
                            autonumber  xxxxxx Kb  5655144 Kb   3:19       2948 ms     1012265 Kb
                            autonumber  xxxxxx Kb  5659204 Kb   3:02       2932 ms     1012265 Kb

                             

                            Conclusion:

                            I should have run another test with InvoiceHead 50 000 000 rows and InvoiceLine 200 000 000 rows, but there was simply no time for that, these tests took almost the whole day.

                             

                            I cannot compare the above results because the test on 20m and 80m the key fields were calculated on the fly by the load script.

                            But for the second test I absolutely needed to pre-create them in the QVD to get the optimized load, else I would have needed some automated test tools to run them.

                             

                            Usually composite keys are in face pre-created in the QVD-layer (data layer) so that the application layer can load all QVD:s optimized.

                            In terms of load time I think composite keys have the edge.

                            Also when it comes to RAM-usage, composite keys use slightly less.

                            Synthetic keys though seem to be more effective, faster calculations.

                             

                            So I would have to re-run the first test.

                            No evidence yet.

                             

                            Uploaded testSyntheticKey7.qvw

                              • Re: Two huge tables with 4 fields in common
                                John Witherspoon

                                Thank you for taking this on.  I'm VERY curious how things change or stay the same with larger volumes of data, and as we make the test a bit more realistic.

                                 

                                I measured peak RAM by having Windows Task Manager open and watching the memory usage of the QlikView process.  I recorded the highest number I saw.  I measured final RAM the same way, looking at memory usage of the process once it completed.  Load time I calculated from the start and end timestamps in the log.  Chart time I already mentioned, and QVW size should be obvious.

                                 

                                I agree that in the real world, you'd create the composite keys during creation of the QVDs, so I probably should have  done that as well.  However, that time is taken SOMEWHERE, so in that case, I would have needed to measure the difference in time for creating the QVDs with and without the composite keys and added that in.  It seemed simpler and hopefully fair to create the composite keys during the load, and just make sure both loads were unoptimized so I wasn't getting an artificial advantage.  With the more realistic approach, you'd want two sets of QVDs, one with the composite key and one without. 

                                 

                                What are you doing for the autonumber tests?  Does the QVD contain autonumbered keys?  In many cases, that would be unrealistic, as QVDs are often created by different QVWs.  However, if we knew that autonumbered keys were better (I believe they are if they can be done), it would be a simple matter to do both the invoice header QVD and the invoice row QVD in the same QVW so that the autonumber results were consistent.  Hmmm, this could also present problems for incremental loads, as we'd likely be using in practice.  Most of mine load from the database first, then from the QVD.  I think that would make it hard to autonumber.  We could use a hash128() perhaps, but I wouldn't.  Even though in practice you'd never have a collision, it is POSSIBLE to have a collision, and I don't like opening that door.  Still, the chances are so low that we might as well be afraid of cosmic rays causing load problems as well.  Or maybe I calculated the chances with hash256().  I can't remember.  Probably hash128() is a practical solution for autonumbered keys when we're using incremental loads.

                                 

                                Hmmm, speaking of incremental loads, that might mean your load times are more fair than I thought, and don't need to include the time to create the composite key.  Most of our large QVDs like this are created incrementally.  Our largest is perhaps twenty million rows.  But we certainly don't load twenty million rows from the database every time we create the QVD.  It's an incremental load, and we probably only load tens of thousands of rows from the database each time.  That time is negligible compared to the time to create a synthetic key across tens of millions of rows every time you load FROM the QVD.

                                 

                                Another good reason to have composite keys is the frequency with which we use exists(key) when loading from QVDs.  In many cases, having it available will speed things up regardless of whether you're ultimately building synthetic keys or not.  And if you've already built composite keys, why build a synthetic key to do the same thing?

                                 

                                I feel like I'm rambling, but hopefully it's good rambling.

                                 

                                Edit: I plan to modify my test to create a set of QVDs with a hash128() key.  I won't count the extra time required to create the key because we are assumed to be using incremental loads.  With that QVD in place, I expect the hash128() key approach to beat the synthetic key approach in all of the measures we're making except for peak RAM.  That is, of course, the limitation that spawned this thread, so I may still recommend synthetic keys in this particular example.  But where that isn't an issue, I expect to be backing off from my position on synthetic keys.  I don't know if I'll be able to get to this today, unfortunately.  I have several tasks that I'm behind on.

                                  • Re: Two huge tables with 4 fields in common
                                    John Witherspoon

                                    Well, here are my results:

                                     

                                    key type   peak RAM  final RAM  load time  chart time  QVW size
                                    synthetic   896 MB   568 MB     0:30       797 MS      115 MB
                                    synthetic   893 MB   576 MB     0:31       828 MS     
                                    synthetic   868 MB   569 MB     0:31       844 MS
                                    synthetic   887 MB   569 MB     0:32       797 MS
                                    synthetic   864 MB   569 MB     0:31       859 MS
                                    hash128    1104 MB   740 MB     0:11       797 MS      146 MB
                                    hash128    1082 MB   733 MB     0:37       781 MS
                                    hash128    1065 MB   733 MB     0:16       765 MS
                                    hash128    1063 MB   730 MB     0:17       813 MS
                                    hash128    1096 MB   733 MB     0:17       782 MS

                                     

                                    I think this is significantly overstating the memory requirements of the hash128 key.  These rows are very short, so a 16-byte key actually represents a pretty big chunk of memory, which probably explains why the RAM usage is so much higher.  I usually have more and bigger fields in real tables, so the hash128 key would be a much smaller piece of the table.

                                     

                                    I don't know why the hash128 load times varied so much.  Nothing else seemed to be going on on my machine.

                                     

                                    I didn't keep retesting QVW size, because in theory it should be exactly the same every time as it's storing exactly the same information every time.  I don't remember any deviations from that yesterday.  It also takes a while to store a file of this size.

                                     

                                    So I made some right predictions, and some wrong.  I might as well have flipped a coin.

                                    • Re: Two huge tables with 4 fields in common
                                      Magnus ÅVITSLAND

                                      Hello again.

                                       

                                      I was unable to continue testing today.
                                      This will just be a quick reply, I will catch up on this on August 15th unfortunately (holiday).

                                      At that date I will load Head 100m, Lines 400m
                                      That would be a quite nice test

                                      The biggest machine has 1Tb RAM and Xeon 7550 with 4CPUs, each with 8 cores = 32 cores in total - BAD ASS I tell you =))))

                                       

                                      John I agree autonumber(w or w/o hash) usually is not possible.
                                      What you can do if possible at the client's DW or DB or whatever they have, is create the mapping table:

                                       

                                      compositeKey               key
                                      date+invoice+pos+store 1
                                      date+invoice+pos+store 2
                                      date+invoice+pos+store 3
                                      date+invoice+pos+store ..n

                                       

                                      And have your DW-views (preferred) join this key table.

                                      That way incremental loads are possible, and you get a very small and effective key.

                                      This I've found very useful.

                                       

                                      Also I did the measurments the same way you did, but I couldn't figure out PeakRAM.
                                      But I guess recording (counters) or just looking at taskman is good enough.

                                       

                                      Cassiano:

                                      If you absolutely need all invoice rows and you run out of memory.

                                      Is that during load or during evaluation of your chart?

                                      In the latter you could always enable "show conditions" and/or "calculation conditions".

                                      That way you can force the user to atleast select ONE YearMonth or one store or whatever.

                                       

                                      Well, this was a very fun end of week before my summer holiday =))

                                      Thanks to you all and especially John W for giving us this nasty fun interesting problem.

                                        • Two huge tables with 4 fields in common

                                          Just to end this discussion my problem is actually the amount of RAM.

                                           

                                          During the load process I left my Task Manager opened and watched the amount of RAM that Qlikview was draining. It reached almost 1.850.000 before the error message. Seems that I need more RAM and a 64-bits machine as John mentioned.

                                           

                                          If anybody has this kind of problem in the future I advise to do the test above.

                                           

                                          I think all the discussion will be helpful to anyone that is looking for information about the options that exists to join large tables needing to use more than just one field for that.

                                           

                                          Thank you John and Magnus, and everyone who tried to help! I really appreciate it.

                                            • Two huge tables with 4 fields in common
                                              Magnus ÅVITSLAND

                                              Always glad to be of service.

                                               

                                               

                                              Kind regards

                                               

                                              Magnus Åvitsland

                                              BI Consultant

                                              Framsteg Business Intelligence Corp.

                                              • Two huge tables with 4 fields in common
                                                Dominic Mander

                                                Just out of interest ...

                                                 

                                                John's tests seem to suggest that peak RAM could be as much as twice final RAM in calculating your key.

                                                 

                                                Given this fact, might there be potential in trying to work around your hardware limitation in this case by creating a loop that loads, for example, only a month's worth of data a time, calculates your key using a hash (assuming you're willing to get over the mathematical offence), stores into a prefixed or suffixed qvd, and then drops the table before going around the loop again to load the next month etc etc.

                                                 

                                                This way you could calculate the keys you need in small chunks, and then load and concatenate the series of qvds created to build your data model seperately without needing to calculate your key all in one go.

                                                 

                                                I haven't considered whether the final model would still be too big for your hardware to handle, but this way you might overcome the peak RAM overhead from calculating your key, and with a little luck that might be enough to get you on your way.

                                                 

                                                Of course if, as Magnus suggests, you are able to get a key in the DW/DB that would also solve the problem of the key calculation overhead.

                                      • Two huge tables with 4 fields in common

                                        This is becoming very interesting!

                                         

                                        Magnus, in one of my attempts before I did exactly as you said, but the same error appears.

                                         

                                        I think is because my RAM limitations, because I'm running:

                                         

                                        • InvoideHead with 24m rows
                                        • InvoiceRow with 42m rows

                                         

                                        With the following specs:

                                        • Windows 7 32-bit
                                        • 4Gb RAM
                                        • Intel Core 2 Duo P8600 @ 2,4Ghz

                                         

                                         

                                        Seems that in 32-bit system a single program can use only 2 Gb of RAM.

                                          • Two huge tables with 4 fields in common
                                            John Witherspoon

                                            Right, a 32-bit program can only use 2 GB of RAM, regardless of how much RAM you actually have.  There's a setting that will give you 3 GB, but when I tried it myself, it made my computer unbootable and it took hours of tech help to recover.

                                             

                                            I can see it being hard to impossible to load 66 million rows of data into 2 GB of memory or 3 GB for that matter.  You may need a 64-bit machine and 64-bit QlikView if you really need that much data to be accessible at once.