Solved: Engine servers are going offline randomly - Page 2 - Qlik Community

Facundo · ‎2022-07-12

TL;DR: Our engine servers are going offline randomly, we don't know why and we've tried everything.

Hi, I'm posting this because we have exhausted all possible troubleshooting, and we are looking for any additional help we can get.
We have been having connection issues on the engine servers for almost a year. We have a multi-node architecture of 5 servers: 1 proxy (virtual), 1 master, and 3 engines which are also configured as schedulers, since we hardly ever load applications on working hours and needed the hardware.

We had the qlik balancing configured, but we unbalanced it on purpose, for more insight and stability (I'll get deeper into this later).

The connection issue is as follows, and can happen on any engine at any given time:
1 - In the QMC, the node appears offline.
2 - The qlik services on the server are up and running.
3 - On the hub, all applications on that engine, do not respond or disappear (this is because of the unbalancing we do).

After this happens, we may have several outcomes:
1 - It comes online again and everything keeps working fine (If it's back online, it's within a five minutes timeframe since it went down, if more than 5 minutes passed, we know it's not coming back).
2 - It comes online again and eventually it goes down again.
3 - It doesn't come online again until we restart the qlik engine service on that server.

In which moment of the day does it happen?
- It can happen on the bussiest hour, when almost no-one is using it, and sometimes during the night, before, after or during the reload schedules.

What happens with the resources on the server the moment the server goes down?
1 - The CPU was at 100% for a long time, and the RAM was at 95%, which is understandable if it goes down.
2 - The CPU was at 1% doing nothing, and the RAM was flat at 70% which is the "min memory usage" we've configured on the QMC, and this is what we don't understand.
3 - Networking wise, we never encounter anything relevant, we have a 1GB network, and all the servers are on the same rack (yes, they're not virtual, only the proxy server).

What about the event viewer logs?
1 - When the resources are at peak, we receive one or severals "fiber loop stall detected" errors, which we understand is because of performance issues, and we're working on optimizing the apps and trying to get more hardware.
2 - Nothing that is logged is relevant, only information type log and eventually the engine error when we stop the service to restart it.

Why did you go for an unbalanced approach?
We did this because when we let qlik to balance, all servers went down eventually and we never knew which application was causing the problem (because we thought it was just one or multiple application that were causing the problem). Now we've customized the security rules and separated the applications into 3 engines: one with the most demanding and consulted ones, another with the middle ones, and the last one with the smaller applications and the least consulted ones, so when one server is down, we can focus on the applications on that server and try to figure out what is happening (spoiler alert, we still don't know).

What did you do until now?
We have:
1 - Monitored everything qlik, windows and networking-wise.
2 - Clean install of everything, windows server and qlik, both with the latest updates.
3 - Configured the antivirus, bios and hardware with the qlik recomendations.
4 - Optimized the models, calculations and overall performance of almost all the applications, no linked tables (whenever possible), everything 1-0 instead of yes-no, and all the formula conditions we try to solve it on the source with a flag so the set analysis is just flag = number instead of multiple conditions.
5 - Tried different app cache time (currently 4 hours), hypercube memory and time limit (currently at 120s), and min-max memory usage (currently at 70-90), memory usage mode on hard limit and CPU throttle at 90%.
6 - Limited the tables like visualizations to only shows and exports 100k rows (though we never saw a problem on the network/export logs or resource).
7 - Having 2 engines and 1 scheduler instead of 3 engine/schedulers.
8 - Different configurations with the page file and virtual memory (currently system managed)
9 - Schedule a restart of the engine services at night before the first scheduled task starts.
10 - This is still on hold but we're going to try, postgresql shared_buffer with 1/4 of the physical memory (currently the default configuration of 1GB)
11 - I'm probably missing other tested configuration.

We understand it's logical for the servers to go down when the resources are at peak for a long time, and that's fine, it's a performance issue, but we don't understand when it happens and the servers are all relaxed with the resources stable or doing nothing.

So, if you think of something else we can try, please let me know, we're testing anything at this point.

Hardware details:
Master: ThinkSystem SR250, 32gb ram, Xeon 3-2134 3.5ghz, 1 socket, 4 cores, 8 logical processors
Proxy: VMware, 16gb ram, Xeon E5-2620 v3 2.4ghz, 4 sockets, 4 virtual processors
Engine 1 (best): ThinkSystem SR570, 256gb ram, Xeon Gold 6130 2.1ghz, 2 sockets, 32 cores, 64 logical processors
Engine 2 (worst): Sun Fire X4470 M2, 288gb ram, Xeon E7-4820 2ghz, 4 sockets, 32 cores, 64 logical processors
Engine 3 (good): ThinkSystem SR570, 256gb ram, Xeon Silver 4110 2.1ghz 2 socket, 16 cores, 32 logical processors

Thanks in advance!

Anil_Babu_Samineni · ‎2022-08-27

@Facundo If that is fixed, Please close the thread marking as correct.

Best Anil, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful

Facundo · ‎2022-09-05

Is "Fixed", the servers are still going offline, but at least, they are always restarting alone since the update, so if we are not checking on a daily basis the "engine is running since" date, we may didn't notice.

Albert_Candelario · ‎2022-09-05

Hello @Facundo ,

Was the upgrade perform already?

Thanks for sharing.

Cheers,

Albert

Please, remember to mark the thread as solved once getting the correct answer

Facundo · ‎2022-09-05

Yes, we're on may sr4 since August.

Engine servers are going offline randomly

Client Managed