Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
hello,
We have the below scenario in our project. We have a S3 bucket. We recieve 3rd party files in that folder. we recieve hourly files in that folder. The number of files could also vary from 2 to 5 depending on the volume of the data.
The requirement is to extract these latest .csv files every hour and process them through Talend to redshift database. Can some one suggest how can we extract ONLY the latest files from S3 bucket out of all the files kept there? would appreciate any inputs for the same.
You can use tS3List to list all the files in a bucket but I'm not sure how you'd decide which are the 'latest'. Is there any sort of time/date in the bucket name or file name?
if there is no information to determine the latest files:
keep a list of the files available/processed
and, use the tS3List to determine which ones arrived since last time
caveat - this is very inefficient
@Matt Evans : Thanks. Yes the file name ( i.e date and time in filename changes every hour). How can the tS3List decide which one is the latest file? what logic can we use there?
@Xuan Junior : Yes the file name ( i.e date and time in filename changes every hour). How can the tS3List decide which one is the latest file? what logic can we use there?
@Xuan Junior : any update on same. file names are like qppxg6dy3oqo_2021-05-25T210000_8fd9627ba6f33235446f8fcb88ca7891_be822a.csv and
qppxg6dy3oqo_2021-05-25T220000_8fd9627ba6f33235446f8fcb88ca7891_2853a9.csv for files from 21:00 and 22:00.
@Matt Evans : any update on same. file names are like qppxg6dy3oqo_2021-05-25T210000_8fd9627ba6f33235446f8fcb88ca7891_be822a.csv and
qppxg6dy3oqo_2021-05-25T220000_8fd9627ba6f33235446f8fcb88ca7891_2853a9.csv for files from 21:00 and 22:00.
with that setup - you could try to retried the date from the filenames
did you try using the dates in the filenames?
@Xuan Junior : didnt understand you completely. if we hardcode the dates in filenames, then the process could not be automated. but we need to automate this process. can you tell me how do i select just the latest files (hourly files) from tS3List component?
There's no easy or clean way which i can see to do this. tS3List can give you the name of each file in a bucket via the CURRENT_KEY after variable but that's all. You could then extract the date and time from the filename in java, perhaps using substring if you are certain the filename will always be in that format. Then build a list of the filenames and extracted times, sort them and chose the most recent. Then use tS3Get to download those files only. But as i said that method is dependant on the date and time always being in the same place in the filename.