Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
smaria
Contributor
Contributor

Looping over a large number of files

Hello,

I need to loop over a large number of files (about 10,000) based on a configuration file.

The

configuration

file is a text file with about about 400 entries (400 lines). I need to extract the first word and the third word in each line (the words are separated by spaces).

Based the on these two words for each entry, I need to search the entire folder of 10,000 files(only the filenames) to find the file/s that matches the current

configuration

file entry. If there is a match, only then I can move the file to a different location.

I have used tFileInputDelimited to extract the required data from the

configuration

file and I have used tFileList and tFileProperties to to get each filename. But this approach is taking a really long time.

Is there a more efficient way to get this job done? I cannot move the files already checked (but no match with configuration file entry) to another folder because subsequent entries in the configuration file maybe a match to the moved file.

Thanks,

Smaria

Labels (4)
2 Replies
Anonymous
Not applicable

The loading of the data in your file won't take a long time, so I am presuming that it is the gathering of file data using the tFileList that is taking the time. I am not sure of what system you are doing this on, but have you thought of using the indexing of your files on your machine carried out by the operating system and doing a command line search using something like the tSystem or tSSH component? You could dynamically build your search from the file data, search using the tSystem or tSSH component and then use the response to identify the file to be worked on.

Anonymous
Not applicable

tFileList uses the default JDK implementation and it can get terribly slow when you're dealing with a large number of files. For example when I had to search for files inside folders it was magnitudes faster to have 2 tFileList 1 that loops on folders then another one that search these folders 1 by 1.

 

if I remember correctly the tFileList filter is done after the listing.

 

If I understand right you want to find 400 needles in a haystack. Likely your current approach is 1 needle in 1 haystack repeated 400 times.

What I'd try to do is:

 

  • Prepare an index of the filenames / full path in advance. ( tFileList -> tIterateToFlow -> t*Output ) (Maybe tHSQLDbOutput is an overkill but it might help for filtering, otherwise CSV is fine)
  • Open your CSV , define 3 columns (we only care about 1st and 3rd no reason to read any more) delimiter space.
  • In a tMap do a lookup to the filenames and do the join accordingly (maybe a filter can match multiple files)
    • HSQL approach: We do the filter via SQL:
      • in the lookup make sure you set up Reload at each row, and populate col1 , col3 to the globalMap
      • Write a SQL query that returns the filepath (i.e. "select "+(String)globalMap.get("col1") +" as key, filePath from myTable where filename like '%" + (String)globalMap.get("col1") +"%' " )
    • CSV approach you have to define the join condition inside the tMap. There's a trick you can use:
      • row2.fileName.length() != row1.fileName.toLowerCase().replace( row1.col.toLowerCase() , "").length()
      • Basically if we have a match, our length after the replace will be shorter, hence the condition becomes true 🙂
  • Then from the output(s) of tMap you can handle your files accordingly.
    • I'd propbably also enable the Catch Inner Join rejects on one of my outputs which should list me all the entries of your CSV for which we had no matches 😉
    • INNER join result should give you the filename you're looking for

 

As you can see I've not used tFile Properties as that is simply overkill to get the filename. My approach will kind of start to search the haystack and for each needle we find we check if it's the one we need and act accordingly.

 

Good luck it's a nice little job and shouldn't be too hard to implement.