Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all!
I'm looking for input regarding a fairly large Sense client-managed environment.
I'm helping a customer with a 8-node cluster. The cluster is a quite back-end heavy setup, so there is a lot of data being handled.
The cluster consist of 4 reload nodes, 2 front-end, 1 development node and 1 main/central. One fileshare for the cluster-root and one fileshare for qvds and other files. External PostgreSQL. Main authentication via Entra and TLS-certificates handled in an external load balancer/reverse proxy.
The development node is a fallback central. The reload nodes does not have a running proxy service. The two frontend nodes have an external load balancer and also share each others Engines. The main and development nodes have separate proxies and only use their own engines.
One of the specifics I'm looking for is how to set up the fallback node so it can take over for the main node if that node goes down. We don't want the development node to run any tasks but we want it to schedule tasks when it is the current central node.
Has anyone figured out a setup that support this. We've got an 'almost' ok setup now where the developer node is set to "Both" and only allow 1 task to run.
The next challenge we have is improving the task load balancing. Tasks and task chains seems to mainly favor backend-1 and backend-2. Some task chains are long, mostly for good reasons, and I believe that the load balancing decision is made at the start of the chain.
Has anyone figured out a good way to handle tasks to better utilize the capacity?
The environment is functional but things can always improve.
Any other tips or input on configuring, managing and improving an environment like this is hugely appreciated.
TIA!
/lars
Hi @Skage ,
The fail-over can't be set to master, since only one master is allowed in the cluster so it will have to be set as Both but then it will receive tasks, unless rules are involved. I'd prefer if this would work w.o. rules.
I opted for isolating central entirely because of that. By that, I mean disassociating the Engine from all Virtual Proxies and setting it as Master only. On the secondary master, you can leave it as MAster/Slave as long as you uncheck the option to "Scheduler to do reloads" on the Nodes QMC section.
What size of machine do you use for the central and the fail-over?
I usually pick 32GB RAM and 4 CPU servers for this roles. I know it can be smaller, and I would make those smaller for other tiers (TEST/QA environments).
One opening for us is if we split the current central into two machines. But since the fail-over WILL have to take tasks then it can't be too small.
Ideally, you would then create the fail-over node without the Scheduler to do reloads unchecked (like below). That way it should respect the settings you configured during the node creation.
I've implemented redundant environments similar to yours. The pain you are facing is familiar!
The way you mentioned scars reminds me of an event where my fail-over was triggered and some of the new features were not available after the secondary node became the central. For example, all the Qlik Cloud distributions failed with messages regarding no node available to distribute.
Let me ask you one question: is any down-time or maintenance window allowed?
I've learned that protecting my central server role by "muting" its Engine is proven to be extremelly reliable. Then, I have mechanisms to re-provision the Central server in an event of a failure, specially if they are virtual. On this scenario, I do not have a fail-over server setup for the Central.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
Hm... the smell of a well-writen question... soo good! LOL
The overall environment is very interesting.
If you want to reduce the hits on a particular server, try reducing the max number of concurrent tasks it can run. Qlik Scheduler algorithm does an excellent job in taking in consideration the percentage of concurrent tasks in use and the overall RAM use. If RAM is under control on one server, and there's still availability to run more tasks on that server, Qlik will most of the time, opt to continue sending traffic to that server.
So, reducing the max number of concurrent should help.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
Skage,
I would enable the Proxy on the Scheduler servers and modify the monitoring_apps REST connections to use "https://localhost". That way, you can protect the Central server and it enhances your availability during the fail-over.
I'm putting topics on different responses so we can discuss each one individually.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
Skage,
Regarding the fail-over node, I would opt for a dedicated server, a small box used in an event of a failure only. This server would not use its Engine for anything, just to perform the Central server role.
On an additional note, I would recommend leaving the Main/Central as Scheduler Master only. So, when it fails over to the other node, it won't run tasks there too. That way, you protect the functionality of the Central server node.
I've created a diagram real-quick to help visualize the concept.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
Thank you.
I'll take some time to investigate when tasks and task-chains are executed and take a look at concurrency and see how that correlate with the max-task-setting.
/lars
I'll check on this next week. I remember having to do something to get the monitoring apps to work properly but I'll have to revisit this.
Good advice!
/lars
Great information and it sparks some ideas.
I think I'll get some resistance if we want even more machines so I was hoping to get this working with the 8 we've got.
The thing that bit us was the fact that nothing gets scheduled unless the fail-over-node is set to Both, scheduler master and worker.
We initially set the development-node in Development-mode and worker. That worked exactly the way we wanted until we accidentally tested the fail over.
It took a while before we realized that the dev-node had taken over the role of central but since it wasn't set to Both it wouldn't schedule any tasks.
The environment was working as it should in every other way so the fail over did take over the functionality but not to 100%.
The fail-over can't be set to master, since only one master is allowed in the cluster so it will have to be set as Both but then it will receive tasks, unless rules are involved. I'd prefer if this would work w.o. rules.
What size of machine do you use for the central and the fail-over?
One opening for us is if we split the current central into two machines. But since the fail-over WILL have to take tasks then it can't be too small.
I wished there was more information/best-practice regarding this topic. It might also be me not finding it or understanding it properly.
The details can be found in isolation but the devil is in the details AND in the full picture when all details have to work together. Getting battle scars is important but certain types of wounds will be costly and sometimes hard to heal.
This IS a complex topic and it would be nice to have more information & recommendations going into the project instead.
/lars
Hi @Skage ,
The fail-over can't be set to master, since only one master is allowed in the cluster so it will have to be set as Both but then it will receive tasks, unless rules are involved. I'd prefer if this would work w.o. rules.
I opted for isolating central entirely because of that. By that, I mean disassociating the Engine from all Virtual Proxies and setting it as Master only. On the secondary master, you can leave it as MAster/Slave as long as you uncheck the option to "Scheduler to do reloads" on the Nodes QMC section.
What size of machine do you use for the central and the fail-over?
I usually pick 32GB RAM and 4 CPU servers for this roles. I know it can be smaller, and I would make those smaller for other tiers (TEST/QA environments).
One opening for us is if we split the current central into two machines. But since the fail-over WILL have to take tasks then it can't be too small.
Ideally, you would then create the fail-over node without the Scheduler to do reloads unchecked (like below). That way it should respect the settings you configured during the node creation.
I've implemented redundant environments similar to yours. The pain you are facing is familiar!
The way you mentioned scars reminds me of an event where my fail-over was triggered and some of the new features were not available after the secondary node became the central. For example, all the Qlik Cloud distributions failed with messages regarding no node available to distribute.
Let me ask you one question: is any down-time or maintenance window allowed?
I've learned that protecting my central server role by "muting" its Engine is proven to be extremelly reliable. Then, I have mechanisms to re-provision the Central server in an event of a failure, specially if they are virtual. On this scenario, I do not have a fail-over server setup for the Central.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
So many great tips to include in my communication with the customer!
I'd like to see more best-practice, hands-on, battle scarred, real world content regarding this from Qlik. The detailed "why" and "how" and reasoning surrounding decisions made would boost everyone's confidence many situations.
Learning by doing, failing and learning is an option but having a path, probably with options depending on requirements, with clear information would be really useful.
Very few have the luxury of having a cluster they can experiment and train setting up/tearing down complex environment, including external load balancers and federated authentication.
The documentation contains the details but not how things really fit together, affect each other or the examples are too shallow.
We have not even touched on many topics like unbalanced clusters, app-pinning, log handling, upgrading, migrating, monitoring and many other things that can be small or large, simple or complex. The list goes on...
/lars
@Skage ,
Agree with you. I'm more than happy to continue debating new items if you have them.
Have a great weekend.
Live and Breathe Qlik & AWS.
Follow me on my LinkedIn | Know IPC Global at ipc-global.com