Hello,
I have a job which needs the latest file from a directory with a lot of files ( over 56.000 with about 700 with the same filemask that I am searching for ).
The file I need is searchable and contains a datetimestamp in the file ( but not always from today or yesterday ).
On a local disk it runs adequate ( it finds the file in about 2 sec ) but if i try it on a windows share which has the files it takes over 40 minutes. What's wrong with it.
The filename I'm searching for is : "test." + context.customernumber + "*.txt"
with the settings sorted by date desc and then a iterate to a tjava which sets a globalvar if it's unset else it does nothing (so this way I get the latest file)
I have tried sorted by date asc and then keeping the last iteration but the time remains the same.
The setup :
Client (which runs Talend) Win7
Server (which has the files on a samba share) windows 2003 server
I am almost desperate enough to create a subjob which gets the complete filelisting unsorted and then sort them in the subjob. But I don't think this is the correct way to go.
Hi
Welcome to Talend Community!
Could you explain in detail about your job logic?
I need to know what the job will do if there is a latest file. It will copy this file or move it?
Sometimes when Talend job try to handle a file(e.g. Excel) which is opend by other user, the job will wait until the file is not in use.
Or i miss some detail?
Regards,
Pedro
Job logic is pretty simple ( in the test-job i created for finding the problem 😞 There are some things coming from a context. tJava_1 : if context property is null then the filemask should be different from when the context property is filled. tFileList_1 : search for all files with the filemask specified in the tJava_1 property (this takes 30 minutes in this example) tJava_2 : print the last record found tFileExist_1 : the start of the job if there is a last file. In this example I was searching without a context property so the filemask should be : class.* The file-specs are : Total files : 16655 class.* files : 486 I don't see where the 30 minutes goes in this job. So there is no opening / closing of files involved. All the files on the server are closed
Ok, some further investigation revealed that tFilelist with sort-options set is terribly slow. It's about 100x faster to build a tfilelist (without sorting) -> tfileinfo -> tsortrow than to use the sorting possibiilities on the tfilelist settings.
tFileList seems to have some problem when working with network paths... I have a directory containing about 2k files and tFileList freezes in spite of the very good latency time of the connection... I suppose it is a bug?
@nc : What kind of network connection are you using? I was using windows-UNC paths (so I guess it uses the SMB-components).
If you are using FTP or some other network connection the problem may be somewhere else...
I'm using a standard windows UNC path as of "\\serverName.domainName.local\sharedDirectory". When I open the UNC path in the windows explorer I see the list of files in a flash and I'm able to walk in each directory without any delay... In spite of the above, when I try to print the directory list with a simple job as of "tFileList->tLogRow" I have to wait many minutes...
Just one note: the "order by" and "order action" setting are left on their default value. I didn't reported well the simple job to test the behavior: it's "tFileList->tIterateToFlow->tLogRow". Thanks, N.