Skip to main content
Announcements
Introducing a new Enhanced File Management feature in Qlik Cloud! GET THE DETAILS!
cancel
Showing results for 
Search instead for 
Did you mean: 
sushantk19
Creator
Creator

Extracting the latest files from Amazon S3 folder

hello,

We have the below scenario in our project. We have a S3 bucket. We recieve 3rd party files in that folder. we recieve hourly files in that folder. The number of files could also vary from 2 to 5 depending on the volume of the data.

The requirement is to extract these latest .csv files every hour and process them through Talend to redshift database. Can some one suggest how can we extract ONLY the latest files from S3 bucket out of all the files kept there? would appreciate any inputs for the same.

Labels (4)
10 Replies
MattE
Creator II
Creator II

You can use tS3List to list all the files in a bucket but I'm not sure how you'd decide which are the 'latest'. Is there any sort of time/date in the bucket name or file name?

XJ_1630
Contributor III
Contributor III

if there is no information to determine the latest files:

 

keep a list of the files available/processed

and, use the tS3List to determine which ones arrived since last time

 

caveat - this is very inefficient

 

sushantk19
Creator
Creator
Author

@Matt Evans​ : Thanks. Yes the file name ( i.e date and time in filename changes every hour). How can the tS3List decide which one is the latest file? what logic can we use there?

sushantk19
Creator
Creator
Author

@Xuan Junior​ : Yes the file name ( i.e date and time in filename changes every hour). How can the tS3List decide which one is the latest file? what logic can we use there?

sushantk19
Creator
Creator
Author

@Xuan Junior​ : any update on same. file names are like qppxg6dy3oqo_2021-05-25T210000_8fd9627ba6f33235446f8fcb88ca7891_be822a.csv and

qppxg6dy3oqo_2021-05-25T220000_8fd9627ba6f33235446f8fcb88ca7891_2853a9.csv for files from 21:00 and 22:00.

sushantk19
Creator
Creator
Author

@Matt Evans​ : any update on same. file names are like qppxg6dy3oqo_2021-05-25T210000_8fd9627ba6f33235446f8fcb88ca7891_be822a.csv and

qppxg6dy3oqo_2021-05-25T220000_8fd9627ba6f33235446f8fcb88ca7891_2853a9.csv for files from 21:00 and 22:00.

XJ_1630
Contributor III
Contributor III

with that setup - you could try to retried the date from the filenames

 

did you try using the dates in the filenames?

 

sushantk19
Creator
Creator
Author

@Xuan Junior​ : didnt understand you completely. if we hardcode the dates in filenames, then the process could not be automated. but we need to automate this process. can you tell me how do i select just the latest files (hourly files) from  tS3List component?

MattE
Creator II
Creator II

There's no easy or clean way which i can see to do this. tS3List can give you the name of each file in a bucket via the CURRENT_KEY after variable but that's all. You could then extract the date and time from the filename in java, perhaps using substring if you are certain the filename will always be in that format. Then build a list of the filenames and extracted times, sort them and chose the most recent. Then use tS3Get to download those files only. But as i said that method is dependant on the date and time always being in the same place in the filename.