Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Remove files that as the same content

Hello ! 

 

I'd like to remove all files that have the same content but keep one.

The final result should be files with all different content.

My file name format is fileName_timestamp.csv

For exemple :

My directory  looks like this : 

- fileName_t1m3st4mp.csv

- fileName_0th3rt1m3st4mp.csv

- fileName_4n0th3rt1m3st4mp.csv

 

Content in my files looks like this :

 

fileName_t1m3st4mp.csv

This is a content

fileName_0th3rt1m3st4mp.csv

This is a content

fileName_4n0th3rt1m3st4mp.csv

This is a different content

 

When i run the job :

fileName_0th3rt1m3st4mp.csv should be deleted

 

Now my directory should only have :

fileName_0th3rt1m3st4mp.csv

- fileName_4n0th3rt1m3st4mp.csv

 

using Talend ESB 7

 

If you have any suggestion, please do !

 

Thanks !

Labels (2)
1 Solution

Accepted Solutions
akumar2301
Specialist II
Specialist II

0683p000009M1tF.jpg

it worked . Removed duplicate files. Try once.

View solution in original post

8 Replies
akumar2301
Specialist II
Specialist II

Try with tFileList , tMemoriseRow tFileCompare and tFileDelete .

 

Not sure if these are part of ESB

Anonymous
Not applicable
Author

Thanks for your response ! 

Those components are indeed in ESB.

 

I need to compare each files with all the others, i'm not sure how i can do that with a FileCompare component since it only allow 1 input.

Can you guide me through your thinking ?

 

Best regards,

 

akumar2301
Specialist II
Specialist II

You are right with tFileCompare you might have some issues.

1) Actually you need to get the checksum of each file using

2) Find files having same checksum and delete the duplicate file.

tFileList --> tFileProperties(MD5 option) --> tFileOutput

onSubJobOK

tFileInput --> tUniqRow (getDuplicate filename based on checksum) --> tFlowtoInterate --> tFileDelete

This should work.
Anonymous
Not applicable
Author

Here you're mainly checking the file name not the actual content.

 

I think i found something. I can log content and filename independently but can't find a way the get both of them at the same time.

My goal here is get a output that contains all the file names and file content. (fileName;fileContent)

I guess i'll be able to use a tUniqRow to check duplicate content once i've figured out this.....

0683p000009M1tA.png

akumar2301
Specialist II
Specialist II

tFileProperties will get checksum based on Filecontent not filename.
akumar2301
Specialist II
Specialist II

0683p000009M1tF.jpg

it worked . Removed duplicate files. Try once.

akumar2301
Specialist II
Specialist II

did it solved your problem ?
Anonymous
Not applicable
Author

What's the component you renamed "selectMD5Option" ?
I'll try that