Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
NMangaba_TAP
Contributor
Contributor

Retrieving files in a S3 bucket using the latest modified date

Hi All,
For every 2 hours i get a new JSON file in a S3 bucket  and i have to take latest  modified file so that i can map it the relevant sql table for output. The named of the file differ as they are generated depending on the day they are processed. 

 

EX : Fri Mar 09 2018 11:22:54 GMT+0000 (UTC).json

        Wed Mar 14 2018 10:09:15 GMT+0000 (UTC).json


can some one help me how to implement this using Talend.
Thanks in advance.

Naledi

Labels (3)
9 Replies
Anonymous
Not applicable

Hello,

To accomplish getting the newest file, we will get a list of files by using tS3get then get the properties for each of them. We will then sort the file properties by "mtime" or the last modified time and then grab the oldest for further processing. 

1) tFileList: this component is configured to look for files.
2) tFileProperties: this component will retrieve the properties for each file. 
3) tBufferOutput: this component will store the file properties in memory so we can sort them once we've got info on all the files.
4) tBufferInput: this component will read from the buffer we populated with file property information
5) tSortRow: this component will sort the files by mtime descending (meaning the oldest file will be first in the list)
6) tSampleRow: this component is how we grab only the first row coming out of tSortRow

0683p000009LtFW.png

Let us know if it is OK with you.

Best regards

Sabrina

 

LI1
Contributor
Contributor

What if your files on S3 are large?

It's unrealistic to pull all files locally and then get properties.

Ideally you could use tS3List to get the modified date as a param and then decide using this to which Key to pull down locally?

 

It's a shame as there is an tFTPFileProperties too, nothing for S3.

 

Vijay_K_N
Contributor
Contributor

u said tfilelist-->tfileproperties-->bufferoutput---->bufferinput-->tsortrow-->tsamplerow-->then it will displays the latest date file but how we can process that file i want to dynamically pass that file path to coming flow

Anonymous
Not applicable

@lli, you can store it to a context variable for used later on other component, eg:
u said tfilelist-->tfileproperties-->bufferoutput---->bufferinput-->tsortrow-->tSamplerow--tJavaRow
on tJavaRow:
context.filename=input_row.columnName;

Regards
Shong
Vijay_K_N
Contributor
Contributor

 

why we need these much components tfilelist itself there is options like modifed date and order by asc/desc options after that we use titeratetoflow component -->tsamplerow-->then we get first latest date file 

anyways thank you my doubt was cleared ,,here we are using tfilelist ....but how can we fetch the data from s3 bucket(latest date file) give me some highlevel clarity 

Anonymous
Not applicable

@lli, after you get the latest file name, you can download the file from S3 using tS3Get to local system and then read the file.
Vijay_K_N
Contributor
Contributor

i didn't get u some files are in s3 bucket i want to get latest file from s3 ,using s3get we can get required file but i want to get latest file ( tfilelist using to get files from directory or folder) but how can we get from latest file (s3) there no such component like s3filelist ..

 

Anonymous
Not applicable

@lli, just review the S3 components, there is no component that can be used to parse the file properties and get the latest file.
if you don't know which file to download, you need to download all files in the specified bucket to local system.


Regards
Shong
Vijay_K_N
Contributor
Contributor

yeah thank you after downloading all the files we can apply the logic to get the latest modified date file ok