Hi Community,
I have many csv files in distributed directory. There are duplicate file-names in those directory. I want to read those files only once, if there are duplicate filename it should read only one file.
example
D:\test\a\ abc.csv, 123.csv,yud.csv
D:\test\b\rd.csv,xy.csv
D:\test\abc.csv,fty.csv
In above you can observe abc.csv is located in 2 locations. I want to read one among these two csv.
Please do needful help.
Thanks,
Sravanth
You need to store the file names. Where (memory/file/database) depends on whether or not you want this de-duplication to persist across runs of your Job.
A database table of processed files may be the sensible option. You can then insert each successfully processed file and then check the database each time you pick up a new one.
If you don't have a database to hand, I always use SQLite for this type of activity.
Hi Alan, Thanks for reply. Can you please say in terms of talend implementation. Show me the way like what component I have use in squeal with screenshot. Thanks, Sravanth