Re: Qlik Sense - Task running on node1 and is bloc... - Qlik Community

kallay · ‎2023-02-23

I have configured an environment with:
- Central/Failover
- 5 nodes (consumer/scheduler/development):
- 1 dev node
- 4 nodes consumer/scheduler
All the tasks run on the 4 nodes (consumer/scheduler).

In some cases the dev node gets overcrowded and it needs some time to recover. At the same time, some task that suppose to start gets stuck on "Triggered", in the QMC the DEV node appears as offline and it can't be used anymore for oppening apps via load balancing.
After a restart to the services on dev node the apps can be opened on this node, but in QMC/Nodes menu, the DEV node is still offline and with all services running except repository.
To resolve this issue i have to restart the services on "central node" (i think i only have to restart service dispatcher on central node, but can't confirm it right now).

My problems are the tasks that get stuck on "Triggered" because of some node that has nothing to do with those tasks.

Is this an expected behavior ?

Alan_Slaughter · ‎2023-02-23

Hi Kallay, maybe the issue was caused by the failover.
Because the scheduler service must be active, the failover candidate will function as a worker scheduler and will receive reload jobs. If it is necessary to minimize the number of reload jobs this node performs, you can set the Max concurrent reloads to 1. If this method is used the customer is responsible for setting this and if it breaks your failover it is the customers responsibility to find root cause. By modifying failover this could cause the failover to not perform as defined. Failover is there to backup your central node in case of failure.

https://community.qlik.com/t5/Official-Support-Articles/Qlik-Sense-Failover-Central-node-Requirement...

https://community.qlik.com/t5/Official-Support-Articles/Concurrent-Reload-Settings-in-Qlik-Sense-Ent...

kallay · ‎2023-02-28

Thank you for your response. I'm the client 🙂 and i had set up the environment but this is not the case you are pointing. The environment is set with custom properties for dedicated apps to dedicated nodes, aswell as for reloading.

The failover node is set only to be the backup of central node if it fails and nothing runs on it, neither on central node. To be more clear, the reloads are set via custom properties with rules in load balancing to use only dedicated nodes and none of the apps use central or failover. I've even set the built in apps (monitoring and license) to ron on consumer nodes. If an app dosen't have a dedicated consumer node it can't be oppened and it can't be reloaded.

The environment works as intended until one of the nodes gets blocked by over use of resources.

i'll leave an example to be more clear:
Node 0 - Central
Node 1 - Failover
Node 2 - Consumer
Node 3 - Consumer
Node 4 - Development

IT's 10:30

10 reloading Tasks set, via load balancing rules and custom properties, to run:
- 6 tasks on Node 2 - every task finishes reloading in 10 minutes and are set to run 2 minutes apart starting with 10:30
- 4 tasks on Node 3 - every task finishes reloading in 15 minutes and are set to run 5 minutes apart starting with 10:30

At 10:25 Node 4 gets overused, is down and some errors appear when oppening apps (this is understandable and i suppose it is as intended)

10:31 - 1 task is triggerd to start on node 2 but is stuck in triggered. 1 task is triggerd to start on node 3 but is stuck on triggered.
10:45 - now we have 2 tasks from node 2 stucked in triggered and 2 tasks from node 3 stucked in triggered.

If Node 4 is not restarted (sometimes the services are can't be restarted, especially repository, and it's needed to restart the machine or kill the service that can't be restarted) all the tasks that suppose to run next will get stuck in triggered.
After an restart to node 4 everything get's back to normal. Even the tasks that were blocked now will resume their activity to normal.

My expectation are that if node get's blocked ... only apps and tasks that were suppose to run on that node to not work as expected and not to get other problems in other areas that are not connected.

Alan_Slaughter · ‎2023-03-28

Hi I would suggest applying the following:

1. https://community.qlik.com/t5/Official-Support-Articles/Qlik-Sense-Randomly-some-streams-or-applicat...

2. https://community.qlik.com/t5/Official-Support-Articles/Qlik-Sense-experiences-port-exhaustion-issue...

kallay · ‎2023-03-30

Hello!

I've tried the two solutions and retested the case.
It's the same.

Tnx.

Alan_Slaughter · ‎2023-03-30

Hi Kallay, I would suggest opening a case with Qlik Support for further investigation.

Thanks

kallay · ‎2023-03-30

Tnx. already did open.

Have a nice day.

BTIZAG_OA · ‎2023-11-14

Hello Kalay,

I have same issue here, did u find any solution for this?

Same problem occurs with the following behaviour

-MultiNode cluster

-App Loadbalancing with CustomProperties

-Master/Manager Scheduler is Central Node

-There is 1 failover node for central.

-8 Server that runs tasks with the corresponding LoadBalancing/CustomProperty

There is a memoryleak problem on our failover node. When Memory usage is about to %97 or higher, node starts to shown as "Node is offline" on qmc. I know this part of problem is related to microsoftwindows , but somehow when this failover node becomes unresponsive(seen as node is offline on QMC), all of our tasks hangs and new triggered tasks stucks on Triggered Status. If we restart failover node, then all of tasks continue without problem.

This node is worker/slave and only runs UDC tasks. I dont get it how this complelety shutdowns all tasks. I know that the root cause of becoming unresponsive related to MS Windows, but this situation is unacceptable to stops operation of entire task load in QlikSite

kallay · ‎2023-11-14

Hello,

No. i had a case open with Qlik Support and the answer was you need more resources and i choose to not continue anymore with the case.

My solution was to hire an administrator to survey all servers and react faster when devnode is use localy.

My conclusion is that this situation is a QlikSense bug.

PS: the times have been reduce where this bug occures because our devs have been more carefully with the local useage of the resources on DevNode.

The nodes are set by default not to exceed 90% ram usage, but our developers also enter the DevNode locally to do QlikView development and sometimes take the ram resources to over 90%. If the server sits with the ram above 95% for more than 1 min, the machine freezes and this bug appears. The solution? Also, restart the blocked server and sometimes restart services on Central and Failover (depends on which tasks and on which node they were running at that time).

Open a case with Qlik Support, maybe you have more luck then i do.

Tnx and have fun.

BTIZAG_OA · ‎2023-11-14

Hello again Kallay,

Thanks for detailed answers, i will open a new ticket but i have no hope it will be investigate seriously. I will writeback here in case of find solution.

Have a nice day

Qlik Sense - Task running on node1 and is blocked in "triggered" after node2 is losing connection.

General Question