Skip to main content
Announcements
July 15, NEW Customer Portal: Initial launch will improve how you submit Support Cases. IMPORTANT DETAILS
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Remove duplicate data in field

Hi There,

Been battling with this issue for a while and hoping someone can help.  I have the below script I'm using to pull data from a spreadsheet.  Within this data, there's a "Cache_Reference_Number.  What I'm trying to do is ignore rows where the Cache Reference number is duplicated.

MYDATA:

LOAD Cache_Id,

     Cache_Crt_Date,

     Cache_Broker_Code,

     Cache_Reference_Number,

     Cache_Incoming

    

From   

C:\Users\Nathan\Documents\ExternalData.xlsx

(ooxml, embedded labels, table is Sheet1);

For example, the below are some Cache Reference Numbers from the file.  I'd want to ignore all the "18312102" rows except for the last one.

18312100
18312102
18312151
18312102
18312102
18312103

Any help on this would be appreciated

1 Solution

Accepted Solutions
swuehl
MVP
MVP

If you say 'except for the last one' you are referring to the input order of MYDATA or maybe the Cache_Crt_Date, right?

Either way, you should be able to resident load your data with your records ordered descending, and use a where not exists() clause to only load the 'last' reference number. Something along these lines:

INPUT:

LOAD  Cache_Reference_Number as  Cache_Reference_Number_Full,

           RecNo() as ID

INLINE [

Cache_Reference_Number

18312100

18312102

18312151

18312102

18312102

18312103

];

RESULT:

LOAD  Cache_Reference_Number_Full as  Cache_Reference_Number,

            ID

Resident INPUT

          where not exists( Cache_Reference_Number,  Cache_Reference_Number_Full)

          order by ID desc ;

drop table INPUT;

View solution in original post

2 Replies
swuehl
MVP
MVP

If you say 'except for the last one' you are referring to the input order of MYDATA or maybe the Cache_Crt_Date, right?

Either way, you should be able to resident load your data with your records ordered descending, and use a where not exists() clause to only load the 'last' reference number. Something along these lines:

INPUT:

LOAD  Cache_Reference_Number as  Cache_Reference_Number_Full,

           RecNo() as ID

INLINE [

Cache_Reference_Number

18312100

18312102

18312151

18312102

18312102

18312103

];

RESULT:

LOAD  Cache_Reference_Number_Full as  Cache_Reference_Number,

            ID

Resident INPUT

          where not exists( Cache_Reference_Number,  Cache_Reference_Number_Full)

          order by ID desc ;

drop table INPUT;

Not applicable
Author

Hey there, thanks so much for the response.  I ended up doing the below, ordering the Cache_Reference_Number and then using Peek, but I think I prefer your implementation.  Thanks for the help.

NoConcatenate

MYNEWDATA:

Load *

Resident MYDATA Order by Cache_Reference_Number;

Drop table MYDATA;

NoConcatenate

DuplicatesRemoved:

Load *

Resident MYNEWDATA

Where Peek(Cache_Reference_Number)<>Cache_Reference_Number;

drop table MYNEWDATA;