Qlik Replicate LOGSTREAM Timeout while waiting to get data from audit file

Pedro_Lopez · Feb 1, 2022 7:38:40 AM

In some cases when Replicate can't read from the Staging Folder of a LogStream task for whatever reason (corrupted folder, lack of disk space etc), there could be difficulties to resume the LogStream task after the initial issue is solved.

You might see errors like the below when trying to resume from timestamp:

[UTILITIES ]E: Failed to write to audit file <audit_folder directory>
[UTILITIES ]E: Timeout while waiting to get data from audit file [1002521] (at_audit_file.c:637)

[UTILITIES ]E: Error reading audit batch [1002509] (at_audit_file.c:679)

Environment

Qlik Replicate 2021.11 (not exclusive)
Linux RHEL / Windows

Resolution

Stop all LogStream and Replication tasks
Kill the replicate sessions and process as the audit file is being locked by the process
Rename the audit folder of the problematic audit file (the folder is the one described under the endpoint settings: "Staging Folder" in the Replicate UI)
Resume LogStream task from timestamp (a few hours before the initial error), then resume replication tasks from the same timestamp
If not solved, a reload of the task will be needed

Cause

Usually, the Replicate process (repctl) is locking the audit file that was being written/read while the issue occurred.

joseph_jbh · ‎2022-09-20

Hey @Pedro_Lopez , thanks for the info!
When I tried to follow the steps, the task was smart enough to look into the renamed folder:

00021812: 2022-09-20T14:13:58 [UTILITIES ]T: open audit file K:\Replicate\logstream\DCS_OUTBOUND\lspDCS_OUTBOUND\LOG_STREAM\audit_service\20210205123530927654_beforeCorruption\7012 for write (at_audit_writer.c:506)
00021812: 2022-09-20T14:13:58 [UTILITIES ]T: Reading audit file 'K:\Replicate\logstream\DCS_OUTBOUND\lspDCS_OUTBOUND\LOG_STREAM\audit_service\20210205123530927654_beforeCorruption\7012' with header version '1' (at_audit_file.c:399)

That's after a resume-by-timestamp. Any thoughts?

Sonja_Bauernfeind · ‎2022-09-21

Hello @joseph_jbh

Have you attempted the reload of the task (last step if the resume does not work), rather than only to resume by time stamp?

All the best,
Sonja

joseph_jbh · ‎2022-09-21

Hi Sonia - Thanks for replying. I'm sure a Reload would work, even if I have to clean out the log_stream folder....But I'm following Pedro's tip as a way to avoid that. Some of our log stream parents supply nearly 75 child tasks which would need to be reloaded too....

Perchance, have you guys seen this symptom on non-HA deployments of Replicate? Ours is an HA deployment, using a Windows failover cluster, and shared storage. I'm curious if this is contributing to the problem.

Sonja_Bauernfeind · ‎2022-09-30

Hello Joseph,

At this point, I would recommend sending that query over to our Qlik Replicate forum directly as it would require additional investigation.

All the best,
Sonja

joseph_jbh · ‎2022-09-30

Understood, thanks Sonja.

Kohila · ‎2023-12-26

Team,

Any solutions (other from reloading) found for the Parent task timeout failure that occurred when the audit file's data was being fetched? We encountered the identical issue today, which required us to reload both Log stream and log replicate.

Task errored and in stopped state. No command line shown up in process tab for the failed task to kill the session locking audit file. No luck with advanced timestamp. Please let us know if already figured out the solution or workaround for resuming the task

Thanks,

Kohila

joseph_jbh · ‎2023-12-28

I think the article is in error...A rename of the folder isn't sufficient: I've since discovered that a log stream parent (LSP) opens every subfolder in its root, and examines the contents. The folder would need to be moved out of the root in order to hide it, if that's the goal.

At any rate, resuming by timestamp will start a new LSP timeline. You can then resume the children by SCN=0, or with the same timestamp you started the parent with.

Out of curiosity, what was the root cause of the timeout? Mine was a corrupted file, seemingly related to multiple, quick failovers during patching (we now bring down the Qlik services before patching).

aarun_arasu · ‎2024-01-04

Hi @Kohila ,

Have you tried the below steps as mentioned in this article

1.Stop all LogStream and Replication tasks

2.Kill the replicate sessions and process as the audit file is being locked by the process

3.Rename the audit folder of the problematic audit file (the folder is the one described under the endpoint settings: "Storage path:" in the Replicate UI) --> This normally means you are creating a new stagging folder

4.Resume LogStream task from timestamp (a few hours before the initial error), then resume replication tasks from the same timestamp

If the above steps did not help then a reload is required.

But , there are few cases where these lock gets released after a server reboot , you can consider this option if its feasible.

aarun_arasu · ‎2024-01-04

Hello @joseph_jbh ,

The timeout may have occurred due to the replicating process (repctl) locking the audit file while it was being written or read when the issue happened.

Or

The file might have been corrupted, causing the replicate to make multiple attempts to read it, resulting in continuous failures due to the corruption.

The above are the possible cause for this issue

Regards

Arun

Qlik Replicate LOGSTREAM Timeout while waiting to get data from audit file