13 Replies Latest reply: Aug 14, 2013 8:00 AM by Peter Schulz RSS

    Remove Duplicates

    Hardik Gandhi

      Hi,

       

      I have a table which holds two key fields: MemberID and CustomerID.

       

      Before moving forward I want to check for any duplicates in table with a Key combination of MemberID + CustomerID. (Both reside in the same table)

       

      Any thoughts?

       

      Thanks,

      H

        • Re: Remove Duplicates
          Jason Michaelides

          Try aggregating the resident table and joining to itself:

           

          Data:

          LOAD

             MemberID,

             CustomerID,

             1 AS Count

          From.....;

           

          JOIN (Data)

           

          LOAD

             MemberID,

             CustomerID,

             Sum(Count) AS Freq

          Resident Data

          Group by MemberID,CustomerID;

           

          That will show you the frequency of each row.

           

          Hope this helps,

           

          Jason

            • Re: Remove Duplicates
              Hardik Gandhi

              I want to flag the duplicates and remove them from the table.

                • Re: Remove Duplicates
                  Jason Michaelides

                  I assume you want to leave one of each duplicate record behind? Will all other fields also be identical? If not, how do you decide which record to keep?

                    • Re: Remove Duplicates
                      Hardik Gandhi

                      Yes all other fields would be identical. So I can keep any one record of the two or more.

                      • Re: Remove Duplicates

                        Hi. I want to do the same thing but all other fields are not identical. For example, if i have 2 customer id's, i only want to take that ID which has a higher "worth". Here "worth" is just a field name. Any help?

                        Thanks

                          • Re: Remove Duplicates
                            Peter Schulz

                            take a look at attached qvw...

                             

                            Script;

                             

                            TMP:

                            LOAD * INLINE [

                                ID, worth

                                1, 1

                                1, 2

                                2, 1

                                2, 2

                                3, 1

                                4, 1

                                4, 2

                                4, 3

                                4, 4

                            ];

                             

                             

                            TEST:

                            Noconcatenate Load

                            ID, max(worth) as worth

                             

                             

                            RESIDENT TMP GROUP by ID;

                             

                             

                            DROP TABLE TMP;

                             

                            I think Group by clause just works with data from resident table.

                             

                            Hope it helps.

                             

                            Regards

                             

                            - PS -

                              • Re: Remove Duplicates

                                Thank you very much. Really helpful. I only have the personal edition so can't view your file. Please could you copy and paste the script into this discussion. Also, I have other variables not just id and worth, where do I put these variables in?

                                Thanks

                                  • Re: Remove Duplicates
                                    Peter Schulz

                                    the script is pasted in my last post...?!

                                     

                                    With variables you mean fields(columns)? With more fields you get an error I think.

                                     

                                    I can imagine 2 possible ways to get the data for a full table(perhaps someone knows a better solution):

                                     

                                    1. Do as described an then do a join with the origin table to get the rest of the fields.But I'm not that confident with joins so I would do it like this:

                                     

                                    2. In short: Create a flag field in the origin table which fields should be loaded(load = 1 or 0)

                                     

                                    Create a merged field of ID and worth, like "ID|worth" while loading TEST(and origin table) and create a Mapping table with that data like:

                                    RELEVANT_ID_MAP:

                                    MAPPING LOAD

                                    ID|worth,1 as relevant

                                    RESIDENT TEST;

                                     

                                    Than load data(fields) from origin table(TMP) with:

                                     

                                    sometemptable:

                                    ApplyMap('RELEVANT_ID_MAP',ID|worth,'0') AS load,

                                     

                                    And finally load the data in a table with ... RESIDENT sometemptable where load=1;

                                     

                                    Look at it more as rough guide not a step by step tutorial...Sorry, I have not the time at the moment. Perhaps later I'll create a sample.

                                     

                                    I admit it looks a bit complicated but if you are used to it is really quick and for me it the safest way. Better than joining and not 100% sure what happens to my data, ending up in a mess.

                                     

                                    And finally I think there is sure somebody (more advanced) who can do this "better"/easier. But at the moment this works for me pretty well.

                                     

                                    regards,

                                     

                                    - PS -