Solved: Re: Best practice for client-managed Sense environ... - Qlik Community

Skage · ‎2025-10-07

Hi all!

I'm looking for input regarding a fairly large Sense client-managed environment.

I'm helping a customer with a 8-node cluster. The cluster is a quite back-end heavy setup, so there is a lot of data being handled.

The cluster consist of 4 reload nodes, 2 front-end, 1 development node and 1 main/central. One fileshare for the cluster-root and one fileshare for qvds and other files. External PostgreSQL. Main authentication via Entra and TLS-certificates handled in an external load balancer/reverse proxy.

The development node is a fallback central. The reload nodes does not have a running proxy service. The two frontend nodes have an external load balancer and also share each others Engines. The main and development nodes have separate proxies and only use their own engines.

One of the specifics I'm looking for is how to set up the fallback node so it can take over for the main node if that node goes down. We don't want the development node to run any tasks but we want it to schedule tasks when it is the current central node.

Has anyone figured out a setup that support this. We've got an 'almost' ok setup now where the developer node is set to "Both" and only allow 1 task to run.

The next challenge we have is improving the task load balancing. Tasks and task chains seems to mainly favor backend-1 and backend-2. Some task chains are long, mostly for good reasons, and I believe that the load balancing decision is made at the start of the chain.

Has anyone figured out a good way to handle tasks to better utilize the capacity?

The environment is functional but things can always improve.

Any other tips or input on configuring, managing and improving an environment like this is hugely appreciated.

TIA!

/lars

hugo_andrade · ‎2025-10-09

Hi @Skage ,

The fail-over can't be set to master, since only one master is allowed in the cluster so it will have to be set as Both but then it will receive tasks, unless rules are involved. I'd prefer if this would work w.o. rules.

I opted for isolating central entirely because of that. By that, I mean disassociating the Engine from all Virtual Proxies and setting it as Master only. On the secondary master, you can leave it as MAster/Slave as long as you uncheck the option to "Scheduler to do reloads" on the Nodes QMC section.

What size of machine do you use for the central and the fail-over?
I usually pick 32GB RAM and 4 CPU servers for this roles. I know it can be smaller, and I would make those smaller for other tiers (TEST/QA environments).

One opening for us is if we split the current central into two machines. But since the fail-over WILL have to take tasks then it can't be too small.
Ideally, you would then create the fail-over node without the Scheduler to do reloads unchecked (like below). That way it should respect the settings you configured during the node creation.

I've implemented redundant environments similar to yours. The pain you are facing is familiar!

The way you mentioned scars reminds me of an event where my fail-over was triggered and some of the new features were not available after the secondary node became the central. For example, all the Qlik Cloud distributions failed with messages regarding no node available to distribute.

Let me ask you one question: is any down-time or maintenance window allowed?

I've learned that protecting my central server role by "muting" its Engine is proven to be extremelly reliable. Then, I have mechanisms to re-provision the Central server in an event of a failure, specially if they are virtual. On this scenario, I do not have a fail-over server setup for the Central.

Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com

View solution in original post

hugo_andrade · ‎2025-10-08

Hm... the smell of a well-writen question... soo good! LOL

@Skage,

The overall environment is very interesting.

If you want to reduce the hits on a particular server, try reducing the max number of concurrent tasks it can run. Qlik Scheduler algorithm does an excellent job in taking in consideration the percentage of concurrent tasks in use and the overall RAM use. If RAM is under control on one server, and there's still availability to run more tasks on that server, Qlik will most of the time, opt to continue sending traffic to that server.

So, reducing the max number of concurrent should help.

Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com

hugo_andrade · ‎2025-10-08

Skage,
I would enable the Proxy on the Scheduler servers and modify the monitoring_apps REST connections to use "https://localhost". That way, you can protect the Central server and it enhances your availability during the fail-over.

I'm putting topics on different responses so we can discuss each one individually.

Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com

hugo_andrade · ‎2025-10-08

Skage,

Regarding the fail-over node, I would opt for a dedicated server, a small box used in an event of a failure only. This server would not use its Engine for anything, just to perform the Central server role.

On an additional note, I would recommend leaving the Main/Central as Scheduler Master only. So, when it fails over to the other node, it won't run tasks there too. That way, you protect the functionality of the Central server node.

I've created a diagram real-quick to help visualize the concept.

Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com

Skage · ‎2025-10-09

@hugo_andrade

Thank you.

I'll take some time to investigate when tasks and task-chains are executed and take a look at concurrency and see how that correlate with the max-task-setting.

/lars

Skage · ‎2025-10-09

@hugo_andrade

I'll check on this next week. I remember having to do something to get the monitoring apps to work properly but I'll have to revisit this.

Good advice!

/lars

Skage · ‎2025-10-09

@hugo_andrade

Great information and it sparks some ideas.

I think I'll get some resistance if we want even more machines so I was hoping to get this working with the 8 we've got.

The thing that bit us was the fact that nothing gets scheduled unless the fail-over-node is set to Both, scheduler master and worker.

We initially set the development-node in Development-mode and worker. That worked exactly the way we wanted until we accidentally tested the fail over.

It took a while before we realized that the dev-node had taken over the role of central but since it wasn't set to Both it wouldn't schedule any tasks.

The environment was working as it should in every other way so the fail over did take over the functionality but not to 100%.

The fail-over can't be set to master, since only one master is allowed in the cluster so it will have to be set as Both but then it will receive tasks, unless rules are involved. I'd prefer if this would work w.o. rules.

What size of machine do you use for the central and the fail-over?

One opening for us is if we split the current central into two machines. But since the fail-over WILL have to take tasks then it can't be too small.

I wished there was more information/best-practice regarding this topic. It might also be me not finding it or understanding it properly.

The details can be found in isolation but the devil is in the details AND in the full picture when all details have to work together. Getting battle scars is important but certain types of wounds will be costly and sometimes hard to heal.

This IS a complex topic and it would be nice to have more information & recommendations going into the project instead.

/lars

hugo_andrade · ‎2025-10-09