Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
colinodonnel
Creator II
Creator II

Load not distinct i.e duplicates only

Hello,

As part of a upstream data cleansing exercise, I am checking the records we are looking for duplicates.

Whilst I am aware of Load Distinct, I am not aware of its opposite

Does it exist?

In the meantime I am using the following script.

Load1:
LOAD
recno() as Recno,
*
FROM (qvd);

NoConcatenate
Load2:
Load *
where DuplicateFlag=1;

Load  Distinct
*,
If(Recno=1, 0, if((Recno - previous(Recno))=1, 0,1))  as DuplicateFlag
Resident Load1
order by Recno;

Drop table Load1;

If a more elegant way exists then I am open to suggestions.

Many thanks,

Colin

1 Solution

Accepted Solutions
petter
Partner - Champion III
Partner - Champion III

Fields = ;

n = NoOfFields('TABLE1');

FOR i=1 TO n

   Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', '& | &' );

NEXT

n=;i=;

.......

.......

        AutoNumber( $(Fields) ) AS T1#


.....

.....

.....


Fields=;

  

You might get better performance and less memory usage by using AutoNumberHash256 instead of AutoNumber:

       AutoNumberHash256( Field1, Field2, Field3 .... )

Change the code above from using a '& | & ' to using a comma:

   Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', ',' );

View solution in original post

5 Replies
wade12
Partner - Creator II
Partner - Creator II

try to create 2 tables then experiment with where exists and where not exists.

refer to this link: Where not Exists and Exits

best wishes.

petter
Partner - Champion III
Partner - Champion III

This is an example on how to do it with WHERE Exists:

T1:

LOAD

    AutoNumber(X & '|' & Y) AS T1#

FROM [lib://DATA/duplicates.xlsx]

(ooxml, embedded labels, table is TABLE1);

T2:

LOAD

    X,

    Y

FROM [lib://DATA/duplicates.xlsx]

(ooxml, embedded labels, table is TABLE2)

WHERE

  Exists(T1#,AutoNumber(X & '|' & Y));

 

DROP TABLE T1;



colinodonnel
Creator II
Creator II
Author

Thank you both.

The only drawback is my dataset has a lot of fields in it.

Is there a quick way to automatically include all the field names inside the Autonumber?

he only method I can think of is to:

A For loop to create a new table with two columns - the original names and the names without spaces

Put all the names without spaces into a variable using concat

Put that variable into the autonumber when creating the new table for the where exists purposes

Applymap back the original names. The duplicates are in a final table

Is just seems like a lot of hard work.

Many thanks,

Colin

petter
Partner - Champion III
Partner - Champion III

Fields = ;

n = NoOfFields('TABLE1');

FOR i=1 TO n

   Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', '& | &' );

NEXT

n=;i=;

.......

.......

        AutoNumber( $(Fields) ) AS T1#


.....

.....

.....


Fields=;

  

You might get better performance and less memory usage by using AutoNumberHash256 instead of AutoNumber:

       AutoNumberHash256( Field1, Field2, Field3 .... )

Change the code above from using a '& | & ' to using a comma:

   Fields = Fields & '[' & FieldName(i,'TABLE1') & ']' & If( i=n , '', ',' );

colinodonnel
Creator II
Creator II
Author

Petter,

Thank you very much for your expertise.

That has solved this conundrum nicely.