Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
JustinDallas
Specialist III
Specialist III

Task failed due to timeout getting engine connection

Hello Everyone,

So at the beginning of this week, I've been getting lots of failed tasks.  The error always says this.

2018-07-19 13:30:16 UTC

Max retries reached (0)

2018-07-19 13:30:16 UTC

Changing task state from Queued to FinishedFail

2018-07-19 13:30:16 UTC

Message from ReloadProvider: Task failed due to timeout getting engine connection

2018-07-19 13:00:16 UTC

Changing task state from Triggered to Queued

2018-07-19 13:00:15 UTC

Trying to start task. Sending task to slave scheduler qliksense.dts

2018-07-19 13:00:14 UTC

Changing task state to Triggered

EngineConnection.PNG

How do I go about diagnosing what the root problem is? I've checked the logs (mostly in the Engine directory) and I don't see anything telling me what the problem could be.

Many of the failing scripts are end user Apps, so they don't reach out to the DB for their data but instead read their data from QVD files.

Since it looks like the reload process is waiting for 30 minutes to get a connection, I'll go see if anything has had a sharp increase in load time over the past week.

Any help on getting to the bottom of this would be greatly appreciated.

1 Solution

Accepted Solutions
JustinDallas
Specialist III
Specialist III
Author

Possible Diagnosis and Solution:

So I think I've figured out what's happening.  It all centers around when I decreased the number of concurrent tasks that can happen at a given time from 4 -> 3 with a Task Timeout of 30 minutes.  Here is what I think happens.

- 00:00 Task A starts and completes successfully

- 00:05 Task A's completion triggers tasks [ SLOWTASK-2, SLOWTASK-3, NOTSLOW-4...NOTSLOW-12]

- 00:10 ST-2 Still Running

            ST-3 Still Running

            NST-4 Runs & Completes

            NST-5 Runs & Completes

            NST-6..12 Queued

- 00:25 ST-2 Still Running

            ST-3 Still Running

            NST-6 Runs & Completes

            NST-7 Runs & Completes

            NST-8..12 Queued

- 00:35 ST-2 Still Running  <--- Times Up!

            ST-3 Still Running

            NST-8 Runs & Completes

            NST-9 Runs & Completes

            NST-10..12 Queued <-- Incomplete and never started tasks

Since the time that NOTSLOW-10 thru NOTSLOW-12 have been queued exceeds the Task Timeout limit, they get listed as failed.  The task didn't fail the usual way we are accustomed to which are usually script error, file contention (someone is writing the file while a task wants to read it), or something is wrong with the datasource (db permissions, file path doesn't exist).  The failure is literally that nothing happened, it got queued up, but never got to take the stage, or contact the engine as they say.

I'm going to increase my concurrent tasks limits and see if that provides any relief.

View solution in original post

2 Replies
JustinDallas
Specialist III
Specialist III
Author

Possible Diagnosis and Solution:

So I think I've figured out what's happening.  It all centers around when I decreased the number of concurrent tasks that can happen at a given time from 4 -> 3 with a Task Timeout of 30 minutes.  Here is what I think happens.

- 00:00 Task A starts and completes successfully

- 00:05 Task A's completion triggers tasks [ SLOWTASK-2, SLOWTASK-3, NOTSLOW-4...NOTSLOW-12]

- 00:10 ST-2 Still Running

            ST-3 Still Running

            NST-4 Runs & Completes

            NST-5 Runs & Completes

            NST-6..12 Queued

- 00:25 ST-2 Still Running

            ST-3 Still Running

            NST-6 Runs & Completes

            NST-7 Runs & Completes

            NST-8..12 Queued

- 00:35 ST-2 Still Running  <--- Times Up!

            ST-3 Still Running

            NST-8 Runs & Completes

            NST-9 Runs & Completes

            NST-10..12 Queued <-- Incomplete and never started tasks

Since the time that NOTSLOW-10 thru NOTSLOW-12 have been queued exceeds the Task Timeout limit, they get listed as failed.  The task didn't fail the usual way we are accustomed to which are usually script error, file contention (someone is writing the file while a task wants to read it), or something is wrong with the datasource (db permissions, file path doesn't exist).  The failure is literally that nothing happened, it got queued up, but never got to take the stage, or contact the engine as they say.

I'm going to increase my concurrent tasks limits and see if that provides any relief.

michaelfreitas
Creator
Creator

I solved it here. Changing EngineTimeout(minutes) from 30 to 120.

 

change.PNG