Debugging Qlik Replicate Crashes

Ted_Manka · Feb 2, 2022 5:45:14 AM

A Qlik Replicate task (or service) crashes when it encounters a critical error and the Operating System (OS) steps in and immediately terminates the task (or service). As such no cleanup is done and the Replicate log file (task or service) immediately ends without the usual stopping message, e.g.,

00001650: 2019-01-14T07:42:40 [AT_GLOBAL ]I: Closing log file at Mon Jan 14 07:42:40 2019 (at_logger.c:2436)

Additionally, the OS will log the crash is its own log file. On Windows, you can view the information from the OS in the Windows Event Viewer applet (Event ID 1001 on repctl.exe). On Linux, the location and method to view it depends on system configuration. (If necessary ask the Linux Admin.)

The basic idea of what needs to be done after a Replicate crash is the same whether Replicate is running on Windows or Linux. You need to get a crash dump file from the OS and match it with a Replicate log file. Ideally, you want a log file with the logger module set to Verbose where the crash occurs but this may not be possible. Replicate core/crash dump files can be rather large, 6 GB or larger. Make sure the disk/partition where the file(s) will be written have enough free space. In general, it is good practice NOT to use the system disk because if it gets filled up the machine will become unstable.

Windows

Capturing:

Microsoft DebugDiag2 needs to be installed on the Windows Replicate Server machine to get a Windows crash dump file. As of Fall 2019 the latest version is v3. Download DebugDiag2 from Microsoft directly.
Once installed run the Collection Tool as a user with Administrator permissions on the machine.
Launch the program from: Start menu > Debug Diagnostics Tool 2 > DebugDiag 2
Create a crash rule. You can use all the defaults except you will need to configure the crash rule for all repctl.exe processes.
The default crash dump folder location is "%SystemDrive%\Program Files\DebugDiag\Logs\Crash rule for all instances of repctl.exe".
This will create crash dumps for the first 10 repctl.exe crashes.

Note: A crash dump file may take 5 GB or more. If you do not have much free disk space on the System Drive you should change the folder to one on another drive that does have enough free disk space.

When Replicate crashes there will be DMP file in the crash dump folder. The PID of the process will be part of the file. Find the Replicate task log whose first line contains the same PID. It will look like "PID: #)" where # is the PID in the crash dump file name.

Example:

00005408: 2020-03-06T13:55:36 [AT_GLOBAL ]I: Task Server Log - SQL_Test (V6.5.0.423 USREM-LOV.qliktech.com Microsoft Windows 8 Enterprise Edition (build 9200) 64-bit, PID: 6812) started at Fri Mar 06 13:55:36 2020 (at_logger.c:2654)

Reading:

Crash dumps can be analyzed either on the Replicate Server machine or any Windows PC that has the DebugDiag2 installed.
Launch the Analysis Tool from Start > Debug Diagnostics Tool 2 > DebugDiag 2 Analysis.
Click on:
1. The first Analysis Rule, CrashHangAnalysis;
2. Add Data Files icon to load the crash dump file; and
3. Start Analysis.
The Analysis Tool will display the results in the Edge browser. The first section, Analysis Summary, will contain a link to the thread that crashed.
Click the link and note the System ID #. The System IDs are the first word in the Replicate task log.
Go to the end of the Relicate log file and search backwards for that System ID. The logger for that System ID will be the third word in that message (e.g., TARGET_LOAD).
If possible recreate the problem but with that particular logger set to the Verbose level. I.e., if thread that is crashing has the TARGET_LOAD logger then set the TARGET_LOAD logger to Verbose.
If you get a new crash dump attach both the latest crash dump and a corresponding log file (with the logger set to Verbose) to the issue for R&D to analyze. If there is no new crash dump file attach the original (only) crash dump file and 2 Replicate log files (one corresponding to crash dump and new one with logger set to Verbose) to the issue for R&D to analyze.

Linux

Capturing:

No extra software is needed to create and analyze core dump files on Linux; however, you do need to ensure that core dump file creation is enabled and where they are created.
To ensure that core dump files are enabled you need to run the command:
```
ulimit -c
```
If the returned value is NOT unlimited you can change it with the command:
```
ulimit -c unlimited
```
Verify the value is changed by running the first command again. Note: While root can always change this value, a non-root user can only change this if the hard limit is already unlimited and the change is applicable to only that one session. If root makes the change it applies to all processes. Also in Linux there is a hard limit and a soft limit.
To see where core dump files are created run the command:
```
sysctl kernel.core_pattern
```
If you need to change the value use the -w option. Note: While any user can run sysctl to see the value of a variable, only root can change variable values with the -w option. For example:
```
sysctl -w kernel.core_pattern=/tmp/%e_%p.dmp
```
If the first word of the returned value is the pipe symbol (|), e.g., "|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %P %I" (without quotes), then the first word after the pipe symbol is a program that processes the core dump file.
Check the manual page for the program to see where/if/how it creates core dump files. Otherwise the contents of the file will be the template for the location of the core dump file.
The template can contain % specifiers which are substituted by the following values when a core file is created:
- %% - a single % character
- %p - PID of dumped process
- %u - (numeric) real UID of dumped process
- %g - (numeric) real GID of dumped process
- %s - number of signal causing dump
- %t - time of dump, expressed as seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC)
- %h - hostname (same as nodename returned by uname(2))
- %e - executable filename (without path prefix)
- %E - pathname of executable, with slashes ('/') replaced by exclamation marks ('!').
- %c - core file size soft resource limit of crashing process (since Linux 2.6.24)
A single % at the end of the template is dropped from the core filename, as is the combination of a % followed by any character other than those listed above. All other characters in the template become a literal part of the core filename. The template may include '/' characters, which are interpreted as delimiters for directory names. The maximum size of the resulting core filename is 128 bytes (64 bytes in kernels before 2.6.19). The default value in this file is "core". For backward compatibility, if the value does not include "%p" and the kernel.core_uses_pid sysctl variable is nonzero, then .PID will be appended to the core filename.

Since version 2.4, Linux has also provided a more primitive method of controlling the name of the core dump file. If the kernel.core_uses_pid sysctl variable contains the value 0, then a core dump file is simply named core. If this file contains a nonzero value, then the core dump file includes the process ID in a name of the form core.PID.

Since Linux 3.6, if the sysctl fs.suid_dumpable variable is set to 2 ("suid‐safe"), the pattern must be either an absolute pathname (starting with a leading '/' character) or a pipe.

If possible you will want to have the PID in the core dump file name since that is critical in finding the right Replicate log file. Use the -w option of the sysctl command to change the value of any if the previously mentioned variables.

Reading:

To analyze a core file you will need the gdb program. Since Replicate is not compiled and linked with debug enabled the only thing you'll be able to get from gdb is a stack backtrace with the bt command. If you can great. The backtrace contains a list of the function names that were open at the time of the crash. The names may give Customer Support (or R&D) an idea of which logger to set to Verbose and reproduce the problem to get a Replicate log file with more information. If not, that's fine. Just attach the core dump files you have along with the corresponding Replicate log files. (You get the Replicate log files in Linux the same way as in Windows - by matching the PID of the first line in the Replicate log file with the PID in the core dump file name.)

In addition, please make sure that TRACE ON DEMAND is not turned on when collecting VERBOSE logs for crashes. Make sure that the option "Store trace/verbose logging in memory, but if an error occurs write to the logs" is not enabled.

Debugging Qlik Replicate Crashes