30 Replies Latest reply: Jan 25, 2017 12:58 PM by s Walter RSS

    Load data without duplicate records - multiple columns

    s Walter

      Hi,

       

      i want to load tables, but without duplicate records. The value in column Product + Customer must be unique. It should be load always the record with the highes SaleNumber.

       

      How can i do it?

       

      Example - the red marked records shouldn´t load:

      2017-01-20_16h57_13.jpg

       

      Regards,

      sam


        • Re: Load data without duplicate records - multiple columns
          Sunny Talwar

          May be this:

           

          Table:

          LOAD Date,

              Product,

              SaleNumber,

              Customer

          FROM

          [..\..\Downloads\mehrere-spalten-vergleichen.xlsx]

          (ooxml, embedded labels, table is Tabelle1);

           

          Right Join (Table)

          LOAD Product,

            Customer,

            Max(SaleNumber) as SaleNumber

          Resident Table

          Group By Product, Customer;

            • Re: Load data without duplicate records - multiple columns
              s Walter

              Hi,

               

              i will try it, thanks.

               

              Is it possible to do that by only 1 load command, directly in the 1st step? Because i have a csv file and its very big (about 500 MB). I want to load less data from this csv. In your solution i would load all data.

               

              Is it correct, that if i load this csv without duplicates like the description in my first thread here. that Qlik Sense need less time to load? Need less memory?

               

              Regards,

              sam

                • Re: Load data without duplicate records - multiple columns
                  Sunny Talwar

                  If the data is sorted in this manner

                   

                  1st by Product,

                  2nd by Customer

                  3rd by SaleNumber descending order

                   

                  then you can do it using a single load... else I don't think you will be able to do this another way

                  • Re: Load data without duplicate records - multiple columns
                    Petter Skjolden

                    You can do it in a single LOAD:

                     

                    LOAD
                        Customer,
                        Product,
                        Date(FirstSortedValue( Date, -SaleNumber)) AS Date,
                        Max(SaleNumber) AS SaleNumber
                    FROM 
                      [lib://DATA/mehrere-spalten-vergleichen.xlsx]
                      (ooxml, embedded labels, table is Tabelle1)
                    GROUP BY
                      Product, Customer
                    ;
                    

                     

                    It does not depend on a certain sort order from the source data.

                      • Re: Load data without duplicate records - multiple columns
                        s Walter

                        Thanks a lot!

                         

                        Why do you do this? Is it important?

                         

                        Date(FirstSortedValue( Date, -SaleNumber)) AS Date,
                        

                         

                        Because, i try it without this line...

                         

                        LOAD
                            Product,
                            Customer,
                            //Date(FirstSortedValue( Date, -SaleNumber)) AS Date,
                            Max(SaleNumber) AS SaleNumber      
                        FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]
                        (ooxml, embedded labels, table is Tabelle1)
                        Group By Product, Customer
                        ;
                        

                         

                        ... and get the correct result. But perhaps it was a coincidence?!

                         

                        If i try this code, too:

                         

                        LOAD
                            Product,
                            Customer,
                            //Date(FirstSortedValue( Date, -SaleNumber)) AS Date,
                            //Max(SaleNumber) AS SaleNumber      
                            SaleNumber
                        FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]
                        (ooxml, embedded labels, table is Tabelle1)
                        Group By Product, Customer
                        ;
                        

                         

                        and get an error:

                         

                        2017-01-20_20h56_58.jpg

                        But don´t really understand why.

                         

                        Regards,

                        sam

                          • Re: Load data without duplicate records - multiple columns
                            Sunny Talwar

                            You only need Group By clause when you are aggregating....

                             

                            LOAD 

                                Product, 

                                Customer, 

                                //Date(FirstSortedValue( Date, -SaleNumber)) AS Date,

                                //Max(SaleNumber) AS SaleNumber       

                                SaleNumber 

                            FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx] 

                            (ooxml, embedded labels, table is Tabelle1) 

                            Group By Product, Customer;


                            Since you commented out both your aggregations, you don't need Group By any more


                            LOAD 

                                Product, 

                                Customer, 

                                //Date(FirstSortedValue( Date, -SaleNumber)) AS Date,

                                //Max(SaleNumber) AS SaleNumber       

                                SaleNumber 

                            FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx] 

                            (ooxml, embedded labels, table is Tabelle1);

                              • Re: Load data without duplicate records - multiple columns
                                s Walter

                                Sorry, but i got the next problem and question.

                                 

                                Now i try to create some fields in the load script and use "Group by" for these fields and get an error.

                                 

                                LOAD
                                    Date,
                                    Product,
                                    Customer,
                                    Max(SaleNumber) AS SaleNumber,     
                                    month (Date) as sMonth,
                                    SubstringCount(Customer, ' ')+1 as countProductWords
                                FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]
                                (ooxml, embedded labels, table is Tabelle1)
                                Group By sMonth, countProductWords;
                                


                                2017-01-20_21h22_05.jpg


                                Is it possible to fix it?


                                regards,

                                sam

                                  • Re: Load data without duplicate records - multiple columns
                                    Sunny Talwar

                                    All your non-aggregated fields need to be in Group By statements or use FirstSortedValue here:

                                     

                                    LOAD 

                                        Date

                                        Product, 

                                        Customer, 

                                        Max(SaleNumber) AS SaleNumber,      

                                        month (Date) as sMonth, 

                                        SubstringCount(Customer, ' ')+1 as countProductWords 

                                    FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx] 

                                    (ooxml, embedded labels, table is Tabelle1) 

                                    Group By Date, Product, Customer;


                                    The newly created fields using existing fields doesn't have to go into Group By. For Example

                                    sMonth is created using Date, so Date goes in group by, but sMonth doesn't. Same is true for countProductWords

                                      • Re: Load data without duplicate records - multiple columns
                                        s Walter

                                        Mhh, my target now is to load tables, but without duplicate records. The value in column sMonth + countProductWords must be unique. It should be load always the record with the highes SaleNumber.

                                         

                                        it´is not possible, if i create this fields in the load script?

                                         

                                        Regards,

                                        sam

                                        • Re: Load data without duplicate records - multiple columns
                                          s Walter
                                          LOAD 
                                              Date,
                                              Product,
                                              Customer,
                                              Max(SaleNumber) AS SaleNumber,     
                                              month (Date) as sMonth,
                                              SubstringCount(Customer, ' ')+1 as countProductWords
                                          FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]
                                          (ooxml, embedded labels, table is Tabelle1)
                                          //Group By Date, Product, Customer; 
                                          Group By sMonth, countProductWords;
                                          
                                          

                                          The newly created fields using existing fields doesn't have to go into Group By. For Example

                                          sMonth is created using Date, so Date goes in group by, but sMonth doesn't. Same is true for countProductWords


                                          Mhh, my target now is to load tables, but without duplicate records. The value in column sMonth + countProductWords must be unique. It should be load always the record with the highes SaleNumber.

                                           

                                          So, how can i get unique records, if i can´t use the created fields sMonth + countProductWords into Group By?

                                           

                                          Regards,

                                          sam

                                            • Re: Load data without duplicate records - multiple columns
                                              Sunny Talwar

                                              May be do it in the preceding load

                                               

                                              LOAD sMonth,

                                                        countProductWords

                                                        Max(SaleNumber) as SaleNumber

                                              Group By sMonth, countProductWords;

                                              LOAD

                                                  Date,

                                                  Product,

                                                  Customer,

                                                  month (Date) as sMonth,

                                                  SubstringCount(Customer, ' ')+1 as countProductWords

                                              FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                              (ooxml, embedded labels, table is Tabelle1);

                                               

                                              Now you next question will be that I lost Product and Customer and Date.... the problem is that you are only sharing bits and pieces of the information and since we don't know what the final goal is we might not be able to help you in one go.... it might help to explain what is the final requirement from this script? Do you just need sMonth and countProductWords or do you also need other fields from your original file? Do you still need to do this in one load? Avoid even preceding load?

                                                • Re: Load data without duplicate records - multiple columns
                                                  s Walter

                                                  Hi Sunny,

                                                   

                                                  yes, you are right, sorry. But i don´t always know all my questions before, because i test, test, test and test something and then i got the next idea, problem, error = question.

                                                   

                                                  So i try to describe the situation and final goal:

                                                   

                                                  The Table and data wich i use here are only test data, because it is easier to use and explain. Actually I have some csv files and they are very big (>300 MB) and have a lot of duplicates. I want to load the files from a directory. With the time new files are added and all csv files should allways load.

                                                   

                                                  I don´t want load all data from theses csv files in Qlik Sense, because to need less time to load and less memory. So i want to load only unique records and check the values of different columns for duplicates to load less data without duplicates.

                                                   

                                                  But in the csv are not all columns wich i need to check for duplicates. So i must first create the columns in the load script.

                                                   

                                                  In this example this is a csv:

                                                  2017-01-21_20h45_26.jpg

                                                  And in the load script i need this table with 2 created fields: sMonth and countProductWords

                                                  so that i can check the values from column Product+sMonth+contProductWords and get always the record with the highest SaleNumber. Then i can load from the csv less data in Qlik Sense.

                                                   

                                                  2017-01-21_20h54_13.jpg

                                                  I hope I could describe my goal completely.

                                                   

                                                  Regards,

                                                  sam

                                                    • Re: Load data without duplicate records - multiple columns
                                                      Sunny Talwar

                                                      I think you need to try this:

                                                       

                                                      Table:

                                                      LOAD Customer,

                                                                Product,

                                                                sMonth,

                                                                countProductWords,

                                                                Date(FirstSortedValue(Date, -SaleNumber)) as Date

                                                                Max(SaleNumber) as SaleNumber

                                                      Group By sMonth, countProductWords, Product, Customer;

                                                      LOAD

                                                          Date,

                                                          Product,

                                                          Customer,

                                                          month (Date) as sMonth,

                                                          SubstringCount(Customer, ' ')+1 as countProductWords

                                                      FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                                      (ooxml, embedded labels, table is Tabelle1);

                                                       

                                                      This will give you all the fields you need and it will group by sMonth, Product, Customer and countProductWords. In case you don't want customer, you can try this

                                                       

                                                      Table:

                                                      LOAD Product,

                                                                sMonth,

                                                                countProductWords,

                                                                Date(FirstSortedValue(Date, -SaleNumber)) as Date

                                                                Max(SaleNumber) as SaleNumber

                                                      Group By sMonth, countProductWords, Product;

                                                      LOAD

                                                          Date,

                                                          Product,

                                                          Customer,

                                                          month (Date) as sMonth,

                                                          SubstringCount(Customer, ' ')+1 as countProductWords

                                                      FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                                      (ooxml, embedded labels, table is Tabelle1);

                                                       

                                                      I hope this gives you what you are looking for....

                                                       

                                                      Best,

                                                      Sunny

                                                        • Re: Load data without duplicate records - multiple columns
                                                          s Walter

                                                          In case you don't want customer, you can try this

                                                           

                                                          Table:

                                                          LOAD Product,

                                                                    sMonth,

                                                                    countProductWords,

                                                                    Date(FirstSortedValue(Date, -SaleNumber)) as Date

                                                                    Max(SaleNumber) as SaleNumber

                                                          Group By sMonth, countProductWords, Product;

                                                          LOAD

                                                              Date,

                                                              Product,

                                                              Customer,

                                                              month (Date) as sMonth,

                                                              SubstringCount(Customer, ' ')+1 as countProductWords

                                                          FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                                          (ooxml, embedded labels, table is Tabelle1);

                                                           

                                                          I want to load all columns from the csv, but check duplicates based on sMonth, countProductWords, Product. So i try this. But now i cant´t find the field "Customer" under the fields.

                                                           

                                                          Regards,

                                                          Sam

                                                            • Re: Load data without duplicate records - multiple columns
                                                              Sunny Talwar

                                                              Did you not look at my response completely? I did give you two script... did you try this?

                                                               

                                                              Table:

                                                              LOAD Customer,

                                                                        Product,

                                                                        sMonth,

                                                                        countProductWords,

                                                                        Date(FirstSortedValue(Date, -SaleNumber)) as Date

                                                                        Max(SaleNumber) as SaleNumber

                                                              Group By sMonth, countProductWords, Product, Customer;

                                                              LOAD

                                                                  Date,

                                                                  Product,

                                                                  Customer,

                                                                  month (Date) as sMonth,

                                                                  SubstringCount(Customer, ' ')+1 as countProductWords

                                                              FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                                              (ooxml, embedded labels, table is Tabelle1);

                                                                • Re: Load data without duplicate records - multiple columns
                                                                  s Walter

                                                                  I read it completly. You wrote

                                                                  In case you don't want customer, you can try this

                                                                  I don´t want to check the field customer for duplicates. Only for Product+sMonth+contProductWords

                                                                   

                                                                  And in the load script i need this table with 2 created fields: sMonth and countProductWords

                                                                  so that i can check the values from column Product+sMonth+contProductWords and get always the record with the highest SaleNumber. Then i can load from the csv less data in Qlik Sense.

                                                                    • Re: Load data without duplicate records - multiple columns
                                                                      Sunny Talwar

                                                                      May be this then....

                                                                       

                                                                      Table:

                                                                      LOAD FirstSortedValue(Customer, -SaleNumber)) as Customer,

                                                                                Product,

                                                                                sMonth,

                                                                                countProductWords,

                                                                                Date(FirstSortedValue(Date, -SaleNumber)) as Date

                                                                                Max(SaleNumber) as SaleNumber

                                                                      Group By sMonth, countProductWords, Product, Customer;

                                                                      LOAD

                                                                          Date,

                                                                          Product,

                                                                          Customer,

                                                                          month (Date) as sMonth,

                                                                          SubstringCount(Customer, ' ')+1 as countProductWords

                                                                      FROM [lib://concatenate/mehrere-spalten-vergleichen.xlsx]

                                                                      (ooxml, embedded labels, table is Tabelle1);

                                                                        • Re: Load data without duplicate records - multiple columns
                                                                          s Walter

                                                                          Hi Sunny,

                                                                           

                                                                          thanks a lot for your help and patience ;-) Now it works.

                                                                           

                                                                          I attched an example and would like to summarize again what the script makes for me and other user´s.

                                                                           

                                                                          The script load Table "mehrere-spalten-vergleichen.xlsx" and add 2 new columns: tempMonth, countProductWords.

                                                                           

                                                                          It checks the values from column Product+Customer+tempMonth for duplicates and load only records with unique values here. It loads always the record with the highest value for SaleNumber from duplicate records.

                                                                           

                                                                          This table is loaded: mehrere-spalten-vergleichen.xlsx

                                                                          2017-01-23_10h31_05.jpg

                                                                          This is a section from the table after load in Qlik Sense:

                                                                           

                                                                          2017-01-23_10h35_47.jpg

                                                                          1. 2 added columns

                                                                          2. Values check for duplicates (Group By Product, Customer, tempMonth;)

                                                                          3. Check for the highest value for each duplicate record (Max(SaleNumber) as SaleNumber)

                                                                           

                                                                          I hope I have described everything correctly and it also helps others.

                                                                           

                                                                          Regards,

                                                                          Sam

                                                                • Re: Load data without duplicate records - multiple columns
                                                                  s Walter

                                                                  I have a problem, i want to load always 1 record from duplicates for Product, Customer, tempMonth.

                                                                   

                                                                  Now in the source table are 2 records with the max(SaleNumber) wich is the same:

                                                                   

                                                                  2017-01-25_17h36_34.jpg

                                                                  If i load...

                                                                   

                                                                  // Each field wich want to load and not Group by must be aggregated
                                                                  // Each field wich want to load and Group by must be not aggregated
                                                                  tblSales:
                                                                  LOAD
                                                                    FirstSortedValue(ID, -SaleNumber) as ID,
                                                                      Date(FirstSortedValue(sDate, -SaleNumber)) as sDate,
                                                                      Product,
                                                                      // To get the records with the highest value
                                                                      Max(SaleNumber) as SaleNumber,
                                                                      Customer,
                                                                      // create field:
                                                                      tempMonth,
                                                                      // create field:
                                                                      FirstSortedValue(countProductWords, -SaleNumber) as countProductWords
                                                                  // Fields to check together for duplicates
                                                                  Group By Product, Customer, tempMonth;
                                                                  LOAD
                                                                    ID,
                                                                      sDate,
                                                                      Product,
                                                                      SaleNumber,
                                                                      Customer,
                                                                      month (sDate) as tempMonth,
                                                                      SubstringCount(Customer, ' ')+1 as countProductWords
                                                                  FROM [lib://sources/mehrere-spalten-vergleichen.xlsx]
                                                                  (ooxml, embedded labels, table is Tabelle1);
                                                                  

                                                                   

                                                                  ... i got this:

                                                                   

                                                                  2017-01-25_17h34_58.jpg

                                                                   

                                                                  What can i do, to get only 1 of these bot records?

                                                                   

                                                                  2017-01-25_17h36_341.jpg

                                                                   

                                                                  Regards,

                                                                  Sam

                                                      • Re: Load data without duplicate records - multiple columns
                                                        Petter Skjolden

                                                        I thought you wanted the corresponding date together with the sales figure. That is why I included this line because that is what I saw as a specification in your sample table.


                                                        Date(FirstSortedValue( Date, -SaleNumber)) AS Date

                                                         

                                                        FirstSortedValue will pick the Date that is ranked by SaleNumber (highest) since ranking goes from lower to higher I have to invert the SaleNumber by inverting the sign (putting a minus in front does the trick).

                                                         

                                                        This will guarantee that you get the corresponding date - but obviously that is not what you want in your resulting table as you made clear. Sunny T explained how the grouping list of fields has to follow the non-aggregated fields in what you want to get as result fields.

                                                • Re: Load data without duplicate records - multiple columns
                                                  Robert Hutchings

                                                  Hi

                                                   

                                                  This question has been answered

                                                   

                                                  But Ive done this as follows (where I want to eliminate not combine duplicate lines)

                                                   

                                                  Do an initial load of DATA: (order by only works with a resident load)

                                                   

                                                  DATA:

                                                  LOAD

                                                  *

                                                  FROM ETC ;

                                                   

                                                  then do a second resident load by ordering as appropriate

                                                  something like

                                                   

                                                  DATA2:

                                                  // LEVEL 2

                                                  load

                                                  *

                                                  WHERE NOT EXISTS (temp1),

                                                  //level 1

                                                  load

                                                  *  ,

                                                  Product & Customer AS temp1

                                                  RESIDENT DATA

                                                  ORDER BY SaleNumber desc ;  // order by descending. Default  ascending


                                                  drop table DATA ;