Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi community,
I think I have a consequent challenge...
I work on big database extracted on Excel file.
I discovered the LOAD DISTINCT --> remove my duplications. But, this time it's different.
After removing my duplications with LOAD distinct, I thought I wouldn't find duplication in my field Ref (The field i'd like to be my unique key). But surprise after looking at one reference, I still have duplications because the 'line' looks like this :
Ref | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | ...... | Field n |
---|---|---|---|---|---|---|---|
Reference 1 | data | data | |||||
Reference 1 | data | data | |||||
Reference 1 | data | data | data | data | data | ||
Reference 1 | data | ||||||
Reference 1 | data | data |
And I'd like to keep the less empty line (here the yellow one).
Is there an awesome manipulation for that ?
Thank you
Hi !
for people who follow the stream,
i found a solution for the case above (It works for date type only):
What I did :
DATA:
LOAD Distinct
Reference,
Concat(date1) as dateone,
Concat(date2 as datetwo
FROM [lib://test/Microsoft Excel Worksheet.xlsx]
(ooxml, embedded labels, table is Feuil1)
Group by [Reference];
left join DATA:
Load
[Reference],
if(len(trim(dateone))>10,date(right(dateone,10)),date(dateone)) as date1,
if(len(trim(datetwo))>10,date(right(datetwo,10)),date(datetwo)) as date2
Resident DATA;
Drop field dateone, datetwo from DATA;
store DATA into [lib://test/test.csv] (txt, delimiter is ';');
Nope. Only unawesome ones. Like for each record checking each field if it contains a value and keep count of the number of fields in the record that have a value. Then keep only the records with the largest counts per Ref value. And then keep only the first records per Ref with the largest counts since there could still be multiple records with the same count.
Thank you for your reply,
So the unawesome ones are the only solution...
after reviewing my database, i can simplify my problem :
Ref | date 1 | date 2 |
---|---|---|
Reference 1 | date | |
Reference 1 | date | |
Reference 1 | date | |
Reference 1 | date | date |
Reference 1 | date | |
Reference 1 | date |
I'd like to keep the line in Yellow and remove other ones, but if there isn't data in the date 2 field keep the first record. All that for all my References... How would you do ?
Thanks.
Hi !
for people who follow the stream,
i found a solution for the case above (It works for date type only):
What I did :
DATA:
LOAD Distinct
Reference,
Concat(date1) as dateone,
Concat(date2 as datetwo
FROM [lib://test/Microsoft Excel Worksheet.xlsx]
(ooxml, embedded labels, table is Feuil1)
Group by [Reference];
left join DATA:
Load
[Reference],
if(len(trim(dateone))>10,date(right(dateone,10)),date(dateone)) as date1,
if(len(trim(datetwo))>10,date(right(datetwo,10)),date(datetwo)) as date2
Resident DATA;
Drop field dateone, datetwo from DATA;
store DATA into [lib://test/test.csv] (txt, delimiter is ';');