Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Our reference file has over 100M rows of data.
The job compares a file with new data against the reference file and rejects the new data to a rejects file.
The tMap stores the reference file's data to a temp folder on disk before the lookup is done.
How can I force Talend DI to keep the reference data in the .bin files in the temp folder on disk after the job finishes running, so the job runs quicker next time? Currently, the files in the temp folder are deleted when the job is done and they have to be recreated the next time the job runs.
The job looks like:
tFileInputDelimited_2
⇩
tFileInputDelimited_1 > tMap > tFileOutputDelimited
Hi,
You cannot store the temp data permanently as it will beat the original requirement to clear the temp space after processing is complete. Since temp data is also a file, there will be still file I/O operations which you cannot avoid. Another flip side of your approach is the overhead and file management issues when the lookup file gets modified.
Considering the file size and processing need, why dont you do this operation using Bigdata Spark Batch job? It will be much faster for these type of huge file operations.
Warm Regards,
Nikhil Thampi
I just download and installed TOS Big Data.
The next step is I need to find or create a HDFS HadoopCluster?
How is TOS Big Data different than normal TOS DI?
Do the Big Data components perform differently?
Hi,
You are right. You will have to create a cluster to host your files.
TOS BigData contains all the features of TOS DI + specialized components and job flow to run big data batch jobs.
The Big Data jobs will be using either Mapreduce or Spark framework to process the data flows.
Warm Regards,
Nikhil Thampi