Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello Everyone,
So at the beginning of this week, I've been getting lots of failed tasks. The error always says this.
2018-07-19 13:30:16 UTC
Max retries reached (0)
Changing task state from Queued to FinishedFail
Message from ReloadProvider: Task failed due to timeout getting engine connection
Changing task state from Triggered to Queued
Trying to start task. Sending task to slave scheduler qliksense.dts
Changing task state to Triggered
How do I go about diagnosing what the root problem is? I've checked the logs (mostly in the Engine directory) and I don't see anything telling me what the problem could be.
Many of the failing scripts are end user Apps, so they don't reach out to the DB for their data but instead read their data from QVD files.
Since it looks like the reload process is waiting for 30 minutes to get a connection, I'll go see if anything has had a sharp increase in load time over the past week.
Any help on getting to the bottom of this would be greatly appreciated.
Possible Diagnosis and Solution:
So I think I've figured out what's happening. It all centers around when I decreased the number of concurrent tasks that can happen at a given time from 4 -> 3 with a Task Timeout of 30 minutes. Here is what I think happens.
- 00:00 Task A starts and completes successfully
- 00:05 Task A's completion triggers tasks [ SLOWTASK-2, SLOWTASK-3, NOTSLOW-4...NOTSLOW-12]
- 00:10 ST-2 Still Running
ST-3 Still Running
NST-4 Runs & Completes
NST-5 Runs & Completes
NST-6..12 Queued
- 00:25 ST-2 Still Running
ST-3 Still Running
NST-6 Runs & Completes
NST-7 Runs & Completes
NST-8..12 Queued
- 00:35 ST-2 Still Running <--- Times Up!
ST-3 Still Running
NST-8 Runs & Completes
NST-9 Runs & Completes
NST-10..12 Queued <-- Incomplete and never started tasks
Since the time that NOTSLOW-10 thru NOTSLOW-12 have been queued exceeds the Task Timeout limit, they get listed as failed. The task didn't fail the usual way we are accustomed to which are usually script error, file contention (someone is writing the file while a task wants to read it), or something is wrong with the datasource (db permissions, file path doesn't exist). The failure is literally that nothing happened, it got queued up, but never got to take the stage, or contact the engine as they say.
I'm going to increase my concurrent tasks limits and see if that provides any relief.
Possible Diagnosis and Solution:
So I think I've figured out what's happening. It all centers around when I decreased the number of concurrent tasks that can happen at a given time from 4 -> 3 with a Task Timeout of 30 minutes. Here is what I think happens.
- 00:00 Task A starts and completes successfully
- 00:05 Task A's completion triggers tasks [ SLOWTASK-2, SLOWTASK-3, NOTSLOW-4...NOTSLOW-12]
- 00:10 ST-2 Still Running
ST-3 Still Running
NST-4 Runs & Completes
NST-5 Runs & Completes
NST-6..12 Queued
- 00:25 ST-2 Still Running
ST-3 Still Running
NST-6 Runs & Completes
NST-7 Runs & Completes
NST-8..12 Queued
- 00:35 ST-2 Still Running <--- Times Up!
ST-3 Still Running
NST-8 Runs & Completes
NST-9 Runs & Completes
NST-10..12 Queued <-- Incomplete and never started tasks
Since the time that NOTSLOW-10 thru NOTSLOW-12 have been queued exceeds the Task Timeout limit, they get listed as failed. The task didn't fail the usual way we are accustomed to which are usually script error, file contention (someone is writing the file while a task wants to read it), or something is wrong with the datasource (db permissions, file path doesn't exist). The failure is literally that nothing happened, it got queued up, but never got to take the stage, or contact the engine as they say.
I'm going to increase my concurrent tasks limits and see if that provides any relief.
I solved it here. Changing EngineTimeout(minutes) from 30 to 120.