Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
victor_greffet
Partner - Contributor III
Partner - Contributor III

Remove duplicates

Hi community,

I think I have a consequent challenge...

I work on big database extracted on Excel file.

I discovered the LOAD DISTINCT --> remove my duplications. But, this time it's different.

After removing my duplications with LOAD distinct, I thought I wouldn't find duplication in my field Ref (The field i'd like to be my unique key). But surprise after looking at one reference, I still have duplications because the 'line' looks like this :

RefField 1Field 2Field 3Field 4Field 5......Field n
Reference 1datadata
Reference 1datadata
Reference 1

data

datadatadatadata
Reference 1data
Reference 1datadata

And I'd like to keep the less empty line (here the yellow one).

Is there an awesome manipulation for that ?

Thank you

1 Solution

Accepted Solutions
victor_greffet
Partner - Contributor III
Partner - Contributor III
Author

Hi !

for people who follow the stream,

i found a solution for the case above (It works for date type only):

What I did :

DATA:

LOAD Distinct

    Reference,

    Concat(date1) as dateone,

    Concat(date2 as datetwo

FROM [lib://test/Microsoft Excel Worksheet.xlsx]

(ooxml, embedded labels, table is Feuil1)

Group by [Reference];

left join DATA:

Load

[Reference],

if(len(trim(dateone))>10,date(right(dateone,10)),date(dateone)) as date1,

if(len(trim(datetwo))>10,date(right(datetwo,10)),date(datetwo)) as date2

Resident DATA;

Drop field dateone, datetwo from DATA;

store DATA into [lib://test/test.csv] (txt, delimiter is ';');

View solution in original post

3 Replies
Gysbert_Wassenaar

Nope. Only unawesome ones. Like for each record checking each field if it contains a value and keep count of the number of fields in the record that have a value. Then keep only the records with the largest counts per Ref value. And then keep only the first records per Ref with the largest counts since there could still be multiple records with the same count.


talk is cheap, supply exceeds demand
victor_greffet
Partner - Contributor III
Partner - Contributor III
Author

Thank you for your reply,

So the unawesome ones are the only solution...

after reviewing my database, i can simplify my problem :

Refdate 1date 2
Reference 1date
Reference 1date
Reference 1date
Reference 1datedate
Reference 1date
Reference 1date

I'd like to keep the line in Yellow and remove other ones, but if there isn't data in the date 2 field keep the first record. All that for all my References... How would you do ?

Thanks.

victor_greffet
Partner - Contributor III
Partner - Contributor III
Author

Hi !

for people who follow the stream,

i found a solution for the case above (It works for date type only):

What I did :

DATA:

LOAD Distinct

    Reference,

    Concat(date1) as dateone,

    Concat(date2 as datetwo

FROM [lib://test/Microsoft Excel Worksheet.xlsx]

(ooxml, embedded labels, table is Feuil1)

Group by [Reference];

left join DATA:

Load

[Reference],

if(len(trim(dateone))>10,date(right(dateone,10)),date(dateone)) as date1,

if(len(trim(datetwo))>10,date(right(datetwo,10)),date(datetwo)) as date2

Resident DATA;

Drop field dateone, datetwo from DATA;

store DATA into [lib://test/test.csv] (txt, delimiter is ';');