Solved: Iterating over a directory and parsing XML files r... - Qlik Community

benu · ‎2021-07-22

Hey folks...I've got a pretty simple job that reads XML files from a directory and passes them to a tFileInputXML component, parsing out half-dozen fields and appending to an Excel file. As it runs, it gets progressively slower. It starts out at maybe 10 files/second, and now that I'm 2600 files in, it's running about 1 file every 3 seconds. I still have thrice that to process, so this job is going to take hours.

Anyone seen this behavior? Recommendations? I'll post a screenshot of my job setup; happy to provide any parameters for components if it'll help us figure out how to make it faster or more linear.

Thanks in-advance! -Ben

Anonymous · ‎2021-07-22

Hello

Try the following changes to see if it could improve the performance.

1 Locate more memory to the job execution, Open Run viewer, click Advanced settings panel and check the 'Use specific JVM parameters' box and modify the JVM parameters.

2 Remove tLogRow component from the job, this component is just for debugging purpose.

3 Click Iterate connector and enable parallel execution

4 If the amount of data in the file is not very large, cache the data in memory using tUnite before tFileOutputExcel to avoid too much interaction with IO file.

Regards

Shong

View solution in original post

Anonymous · ‎2021-07-22

Hello

Try the following changes to see if it could improve the performance.

1 Locate more memory to the job execution, Open Run viewer, click Advanced settings panel and check the 'Use specific JVM parameters' box and modify the JVM parameters.

2 Remove tLogRow component from the job, this component is just for debugging purpose.

3 Click Iterate connector and enable parallel execution

4 If the amount of data in the file is not very large, cache the data in memory using tUnite before tFileOutputExcel to avoid too much interaction with IO file.

Regards

Shong

gjeremy1617088143 · ‎2021-07-23

Hi,

you do extraction transformation and load a the same time it can make the job very slow .

Maybe you can try to send all the data in memory with tHashOuput for example,

and after onsubjobok link tHashInput-->main row--> tFileOutputExcel.

You can test for example : desactivate tFileOutputExcel ans see your row/s processing.

if you have an average row process speed so you can test my solution.

Send me Love and Kudos

benu · ‎2021-07-23

Hello folks. @Shong - those suggestions worked great! I implemented all but the parallel execution setting on the iterate connector, for when that was enabled, my job would not compile (error "Detail Message: Local variable tos_count_tFileOutputExcel_1 defined in an enclosing scope must be final or effectively final There may be some other errors caused by JVM compatibility. Make sure your JVM setup is similar to the studio.")

However, when the other changes were made, the job executed in a few seconds, rather than taking hours.

Thanks very much for your prompt response and helpful advice! -Ben

Anonymous · ‎2021-07-28

@ben uphoff, I tried and have the same issue, I am not sure if it is a bug, I will check it with our developers.

As gjeremy1617088143 suggested, send all the data in memory with tHashOuput for example,

and after onsubjobok link tHashInput-->main row--> tFileOutputExcel.

Regards

Shong

Iterating over a directory and parsing XML files runs progressively slower

Talend Studio

v7.x

XML