Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
We are seeing a weird problem with TOS 7.3. We deploy / build a standalone job and run it on a Windows machine. The job executes every 10 minutes via a task scheduler.
Here's a small extract of a successful log file:
2021-12-09 16:55:04,692 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tFileList_3 - Start to list files
2021-12-09 16:55:04,692 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tFileList_3 - Current file or directory path : \\xxxxxxxx\xxxxxxxx\xxxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxxxxxxxxxxxxxxxxxxxx.csv
2021-12-09 16:55:04,707 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tRunJob_5 - The child job 'dataintegration.absencesandholidaystotable_1_2.AbsencesAndHolidaysToTable' starts on the version '1.2' with the context 'Production'.
2021-12-09 16:55:04,707 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] TalendJob: 'AbsencesAndHolidaysToTable' - Start.
Received file'\\xxxxxxxx\xxxxxxxx\xxxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxxxxxxxxxxxxxxxxxxxx.csv' as input parameter
2021-12-09 16:55:05,364 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] tFileInputDelimited_2 - Retrieving records from the datasource.
2021-12-09 16:55:05,364 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] tFileInputDelimited_2 - Retrieved records count: 1.
The job might execute successfully 120 times (= every 10 minutes for 24, 25, 26 hours but then, without any clear reason, one of the executions dies like this:
2021-12-09 15:25:03,932 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tFileList_3 - Start to list files
2021-12-09 15:25:03,932 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tFileList_3 - Current file or directory path : \\xxxxxxxx\xxxxxxxx\xxxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxxxxxxxxxxxxxxxxxxxx.csv
2021-12-09 15:25:03,963 [INFO] d.t.TALE35_AbsencesAndHolidaysService [Thread-2] tRunJob_5 - The child job 'dataintegration.absencesandholidaystotable_1_2.AbsencesAndHolidaysToTable' starts on the version '1.2' with the context 'Production'.
2021-12-09 15:25:03,963 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] TalendJob: 'AbsencesAndHolidaysToTable' - Start.
Received file'\\xxxxxxxx\xxxxxxxx\xxxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxx\xxxxxxxxxxxxxxxxxxxxxxxxxx.csv' as input parameter
2021-12-09 15:25:04,526 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] tFileInputDelimited_2 - Retrieving records from the datasource.
2021-12-09 15:25:04,542 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] tFileInputDelimited_2 - Retrieved records count: 0.
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
The file definitely contains data. By all accounts valid data. The tFileInputDelimited component will not be able to consume the FTP files anymore, no matter how many times we try to rerun the job. And here is the weird thing:
If we re-build the job without changing anything, just rebuild it over the old publish, then on the next run the job works again and can consume the files. The it will again dies sometime 24,25,26 hours laters and will not recover by time. If we again rebuild the job to an executable jar, the problem goes away. It is like the deployed package breaks every now and then but I can't imagine any scenarious where that might happen.
So I'm looking for debugging ideas, what could cause the deployed package to repeatedly break so that rebuilding the job fixes the problem?
The files are located in a shared folder? and each file has data? I see one log message mentions "[Thread-2] tFileInputDelimited_2 - Retrieved records count: 0." It looks like the error occurs on accessing the file. For debugging, check the 'die on error' box on tFileInputDelimited component and the 'die on error' box on tRunJob, this allows the component throws out the Java exception once an error occurs.
Regards
Shong
"The files are located in a shared folder? and each file has data?"
Correct. The main job first downloads the files to the shared folder (there might be 1, or 200) and after tFTPGet is done, we start to iterate those downloaded files and pass them to the childjob. Each file that we have investigated contains correct data, same kind of data that has previously passed. We even checked new line characters and file encodings if those would be different in files that fail to be processed but that is not the case either. The shared folder seems by all accounts to be accessible all the time. No errors indicate a file access issue.
"I see one log message mentions "[Thread-2] tFileInputDelimited_2 - Retrieved records count: 0." It looks like the error occurs on accessing the file. For debugging, check the 'die on error' box on tFileInputDelimited component and the 'die on error' box on tRunJob, this allows the component throws out the Java exception once an error occurs."
These are already checked. The last thing in the job logs is this:
2021-12-09 15:25:04,542 [INFO] d.a.AbsencesAndHolidaysToTable [Thread-2] tFileInputDelimited_2 - Retrieved records count: 0.
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
The death of the job seems not to be clean, controlled. When the job succeeds, the last line is
2021-12-10 07:35:04,684 [INFO] d.t.TALE35_AbsencesAndHolidaysService [main] TalendJob: 'TALE35_AbsencesAndHolidaysService' - Done.
Hello,
Did you try to modify the context variables in the parent Job tRunJob, and select the Transmit whole context and Die on child error check box?
Would you mind posting your job design screenshots here which will be helpful for us to address your issue. Please mask your sensitive data.
Best regards
Sabrina
I have not experimented with the Transmit whole context option, but the Die On Child Error is checked. As you review these settings, please consider that it works 99,9% of the time. When it fails, there are no clear abnormalities to be detected: the files that fail to process are the same size as the previous that succeeded, the folder and filename follow the same syntax, they look like they should have processed and as a fact, they do after rebuilding and rerunning that job.
Here is the main job:
Here is the tRunJob (one of those):
Here is the child job. The component "Absence or Holiday" is the tFileInputDelimited_2 that is the main concern of our issue. That is the component that randomly fails to read a file that by all means of monitoring, in Notepad++, contains valid data. And after the job is re-built without modifications and re-run, the file is read. Go figure.