Hello,
As part of a upstream data cleansing exercise, I am checking the records we are looking for duplicates.
Whilst I am aware of Load Distinct, I am not aware of its opposite
Does it exist?
In the meantime I am using the following script.
Load1:
LOAD
recno() as Recno,
*
FROM (qvd);
NoConcatenate
Load2:
Load *
where DuplicateFlag=1;
Load Distinct
*,
If(Recno=1, 0, if((Recno - previous(Recno))=1, 0,1)) as DuplicateFlag
Resident Load1
order by Recno;
Drop table Load1;
If a more elegant way exists then I am open to suggestions.
Many thanks,
Colin
Fields = ;
n = NoOfFields('TABLE1');
FOR i=1 TO n
Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', '& | &' );
NEXT
n=;i=;
.......
.......
AutoNumber( $(Fields) ) AS T1#
.....
.....
.....
Fields=;
You might get better performance and less memory usage by using AutoNumberHash256 instead of AutoNumber:
AutoNumberHash256( Field1, Field2, Field3 .... )
Change the code above from using a '& | & ' to using a comma:
Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', ',' );
try to create 2 tables then experiment with where exists and where not exists.
refer to this link: Where not Exists and Exits
best wishes.
This is an example on how to do it with WHERE Exists:
T1:
LOAD
AutoNumber(X & '|' & Y) AS T1#
FROM [lib://DATA/duplicates.xlsx]
(ooxml, embedded labels, table is TABLE1);
T2:
LOAD
X,
Y
FROM [lib://DATA/duplicates.xlsx]
(ooxml, embedded labels, table is TABLE2)
WHERE
Exists(T1#,AutoNumber(X & '|' & Y));
DROP TABLE T1;
Thank you both.
The only drawback is my dataset has a lot of fields in it.
Is there a quick way to automatically include all the field names inside the Autonumber?
he only method I can think of is to:
A For loop to create a new table with two columns - the original names and the names without spaces
Put all the names without spaces into a variable using concat
Put that variable into the autonumber when creating the new table for the where exists purposes
Applymap back the original names. The duplicates are in a final table
Is just seems like a lot of hard work.
Many thanks,
Colin
Fields = ;
n = NoOfFields('TABLE1');
FOR i=1 TO n
Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', '& | &' );
NEXT
n=;i=;
.......
.......
AutoNumber( $(Fields) ) AS T1#
.....
.....
.....
Fields=;
You might get better performance and less memory usage by using AutoNumberHash256 instead of AutoNumber:
AutoNumberHash256( Field1, Field2, Field3 .... )
Change the code above from using a '& | & ' to using a comma:
Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', ',' );
Petter,
Thank you very much for your expertise.
That has solved this conundrum nicely.