3 Replies Latest reply: Nov 20, 2017 8:55 AM by Victor GREFFET RSS

    Remove duplicates

    Victor GREFFET

      Hi community,

       

      I think I have a consequent challenge...

       

      I work on big database extracted on Excel file.

      I discovered the LOAD DISTINCT --> remove my duplications. But, this time it's different.

       

      After removing my duplications with LOAD distinct, I thought I wouldn't find duplication in my field Ref (The field i'd like to be my unique key). But surprise after looking at one reference, I still have duplications because the 'line' looks like this :

       

      RefField 1Field 2Field 3Field 4Field 5......Field n
      Reference 1datadata
      Reference 1datadata
      Reference 1

      data

      datadatadatadata
      Reference 1data
      Reference 1datadata

       

      And I'd like to keep the less empty line (here the yellow one).

      Is there an awesome manipulation for that ?

       

      Thank you

        • Re: Remove duplicates
          Gysbert Wassenaar

          Nope. Only unawesome ones. Like for each record checking each field if it contains a value and keep count of the number of fields in the record that have a value. Then keep only the records with the largest counts per Ref value. And then keep only the first records per Ref with the largest counts since there could still be multiple records with the same count.

            • Re: Remove duplicates
              Victor GREFFET

              Thank you for your reply,

               

              So the unawesome ones are the only solution...

              after reviewing my database, i can simplify my problem :

               

              Refdate 1date 2
              Reference 1date
              Reference 1date
              Reference 1date
              Reference 1datedate
              Reference 1date
              Reference 1date

               

               

              I'd like to keep the line in Yellow and remove other ones, but if there isn't data in the date 2 field keep the first record. All that for all my References... How would you do ?

               

              Thanks.

                • Re: Remove duplicates
                  Victor GREFFET

                  Hi !

                  for people who follow the stream,

                  i found a solution for the case above (It works for date type only):

                  What I did :

                   

                  DATA:

                  LOAD Distinct

                      Reference,

                      Concat(date1) as dateone,

                      Concat(date2 as datetwo

                  FROM [lib://test/Microsoft Excel Worksheet.xlsx]

                  (ooxml, embedded labels, table is Feuil1)

                  Group by [Reference];

                   

                  left join DATA:

                  Load

                  [Reference],

                  if(len(trim(dateone))>10,date(right(dateone,10)),date(dateone)) as date1,

                  if(len(trim(datetwo))>10,date(right(datetwo,10)),date(datetwo)) as date2

                  Resident DATA;

                   

                  Drop field dateone, datetwo from DATA;

                   

                  store DATA into [lib://test/test.csv] (txt, delimiter is ';');