Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello,
I need to loop over a large number of files (about 10,000) based on a configuration file.
The
configuration
file is a text file with about about 400 entries (400 lines). I need to extract the first word and the third word in each line (the words are separated by spaces).Based the on these two words for each entry, I need to search the entire folder of 10,000 files(only the filenames) to find the file/s that matches the current
configuration
file entry. If there is a match, only then I can move the file to a different location.I have used tFileInputDelimited to extract the required data from the
configuration
file and I have used tFileList and tFileProperties to to get each filename. But this approach is taking a really long time.Is there a more efficient way to get this job done? I cannot move the files already checked (but no match with configuration file entry) to another folder because subsequent entries in the configuration file maybe a match to the moved file.
Thanks,
Smaria
The loading of the data in your file won't take a long time, so I am presuming that it is the gathering of file data using the tFileList that is taking the time. I am not sure of what system you are doing this on, but have you thought of using the indexing of your files on your machine carried out by the operating system and doing a command line search using something like the tSystem or tSSH component? You could dynamically build your search from the file data, search using the tSystem or tSSH component and then use the response to identify the file to be worked on.
tFileList uses the default JDK implementation and it can get terribly slow when you're dealing with a large number of files. For example when I had to search for files inside folders it was magnitudes faster to have 2 tFileList 1 that loops on folders then another one that search these folders 1 by 1.
if I remember correctly the tFileList filter is done after the listing.
If I understand right you want to find 400 needles in a haystack. Likely your current approach is 1 needle in 1 haystack repeated 400 times.
What I'd try to do is:
As you can see I've not used tFile Properties as that is simply overkill to get the filename. My approach will kind of start to search the haystack and for each needle we find we check if it's the one we need and act accordingly.
Good luck it's a nice little job and shouldn't be too hard to implement.