Re: Reading large flat files in stream mode: doubt... - Qlik Community

Anonymous · ‎2014-11-18

Hi,
There is a scenario where we need to read millions of records from database and have lookup with multimillion records...
- To ease the lookup, used save lookup data on temp path using variable inside tMap
- in order to save memory for main data to be loaded into memory, use filesystem as an input in place of data, where the data from database table is saved in flat file before starting the process.
Question is, if we leverage the the feature of enable the reading data from the file stream, will it help to solve memory related issues in previous case? Previous case means, reading data from file / data base without stream.
Will this load data from flat file in continuous stream mode and have a lookup with the data available in the temp path using tMap?
https://help.talend.com/search/all?query=tFileOutputDelimited&content-lang=en
Vaibhav

Anonymous · ‎2014-11-18

I would create a dedicated lookup tables containing only the essential columns and an index which points exactly to the search condition.
Now I would use this table (probably prepared in a previous job) and set as load type "Reload Each Row" with using globalMap expressions to configure the lookup where conditions.
With this scenario you are avoid memory leaks and it can run very fast (of course not as fast as you hold the lookup in the memory it self).

Anonymous · ‎2014-11-18

I would create a dedicated lookup tables containing only the essential columns and an index which points exactly to the search condition.
>> I can't have index on lookup column in this case
I am having doubt, whether loading data using file input stream would enhance performance as compared to normal data loading from file/database?
Vaibhav

Anonymous · ‎2014-11-18

Hi Vaibhav,
I do not think so, using a file input stream improve your performance. Because the stream must be streamed through e.g. to its end to reach the record you need. A database does this task much more effektive than you could implement.
If your main database does not allow creating additional tables or creating indexes you could try using a database like Java DB or H2. Both databases are InMemory databases and provides a veritable speed. You could copy the massive lookup data into such database.
Reading a file as lookup with the option Reload each row is definitely a performance killer.

Anonymous · ‎2014-11-19

Thanks Jan,
Ok, so based on this what I could identify is to break the lookup as well as source data into the chunks/batches and execute it... this would certainly improvise the performance of job execution and could manage millions or any number of records very quickly...
I will try this breakup and sure that it will help for managing large data in lookups as well as source...
Thanks
vaibhav

Anonymous · ‎2014-11-19

Of course you should do that. If the source does not depend on chunks of lookup data, wipe out the lookup records not needed.

Anonymous · ‎2019-09-27

Hi @lli

I saw your reply about using InMemory databases like H2 and Derby. I am in a situation where i need to use this kind of databases.

The problem is that i don't see how to import and use it in TOS. I also want to know if it is fully transparent for user (i develop and test, but someone else deploys and runs jobs), so the user don't have to configure or install anything, just run the job as they used to.

Can you help me with this please ?

Thanks a lot.

Reading large flat files in stream mode: doubts

Talend Data Integration

v5.x