I am new to managing QlikView and I am working on an environment that has some sporadic issues. I feel like I am hitting a wall in terms of new ideas to try, so I'm posting this in hopes that somebody might have a few suggestions.
First, a bit of info on our architecture. We run a clustered environment with the following roles, all running QV 11.20 SR 6 on Windows Server 2012:
Server 1: Publisher & QMC
Server 2: QVS, DSC: & Access Point
Server 3: QVS
File Server Cluster (2 servers): Presents SAN CIFS share, which is our document root. Path to root is defined as \\<ipaddress>\qvprd.
SAN: Hosts qvprd share as well as other non-QlikView shares.
Our systems appear to be over-sized for current performance needs, so I wouldn't think this is a resource issue. All have 512 GB RAM and 4x 6-core processors at 2.89GHz (24 cores total). I watch these systems closely and the QVS servers never break a sweat and almost always are below 60% memory allocation. The Publisher server is even lower, with heavy CPU utilization only during our busy reload times (3am-6am). We currently have a maximum of 12 reload engines running concurrently.
All systems run System Center Endpoint Protection and have exclusions for QlikView-related folders and files. And I have confirmed that backups are not running during the times that we experience problems.
Two or three times a week, we will encounter an error that affects QVS on Servers 2 & 3. Most of the time the errors we see start off on one of the two servers with the following:
- CQvXmlInterfaceRequestHandler - Catch: Threw an error...
This is accompanied by the "No Server" message in Access Point, and our only recourse is to restart QVS on that system, which resolves the issue immediately. Then, within a few minutes, we usually see either the same error message on the other QVS server, or we see an inconsistency (usually type D, but sometimes type F):
- Restart: Server aborted trying to recover by restart. Reason for restart: Internal inconsistency, type F, detected.
- qvpx: Exception while handling request
This usually seems to happen between 8:00 and 10:00 AM. Most of our reloads finish up by 6:00 AM, so there's not a lot of action on these systems outside of the QVS services. At any given time, we usually have between 20 and 35 concurrent users spread across the two systems.
To make things more confusing, we get PGO errors every week or two. We'll have extension-less 0KB files named things like 'CalID' appear in the root folder and C:\ProgramData\QlikTech\QlikViewServer\. The error message in the logs usually have invalid characters in the path, like the message below. These are always accompanied by a QVS crash or restart:
- PGO: Failed to open C:\P\CalD鮵ꁑ. Error : C:\P\CalD鮵ꁑ contains an incorrect path.. Time: 0 ms
So far, the way we have dealt with those problems is to take a QlikView outage, clean up the PGO files, and restart services.That usually sets us straight for another week or two (max).
I will attach some logs from both server 1 and 2 from this morning, in which we had the exact issue described above. I'm hoping that somebody may have some insight into this problem or suggestions on what to look at. I'm running out of ideas, and I'm really struggling to correlate this to any one thing. There are some weeks where it'll happen 3 times, and some weeks where it happens maybe once. And other than these random crashes, our system runs like a champ. It just has this annoying hiccup.
Anybody have any ideas? I'm all ears.