Qlik Replicate Slot in Postgres Not Advancing

gkaur · ‎2024-08-10

I have a Qlik Replicate task with Postgres as source and Snowflake as target.

On many occasions, I have noticed that the task would indicate zero latency, but the slot size on the source DB kept increasing while the restart_lsn and confirmed_flush_lsn never changed.

I tried running task with confirmed_flush_lsn setting via Advanced Run option but it did not change any thing. nor resuming task.

Please advise if I should change some configuration on source/target side, as if slot size reaches its max[set on PG side to avoid database outage], then tasks stops and we are forced to run reload and we cannot keep truncate/reload tables in production.

Snippet of the task with growing log size and indicating no latency -

john_wang · ‎2024-08-10

Hello @gkaur ,
Thanks for reaching out to Qlik Community!

Based on the available information, it appears that the monitored tables have not undergone any changes. This is likely why we see that the "Change Processing → Apply Latency" is 00:00:00, and there are no Incoming Changes (0 transactions). However, other tables, which are not being monitored by Qlik Replicate, are experiencing significant changes. These changes continue to occupy the LSN position in the replication slot, thereby preventing truncation of the WAL. Since the WAL is a server-wide resource used by all PostgreSQL processes, it may grow extremely large if no changes are captured for an extended period.

To prevent this from happening, please enable the "WAL heartbeat" option (by default, it's disabled), for example:

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

gkaur · ‎2024-08-11

Thanks John for your quick response but I already have WAL heartbeat ON -

john_wang · ‎2024-08-11

Hello @gkaur ,

Thanks for your update. We need additional information to understand the behavior:

1. Set SOURCE_CAPUTRE to Verbose, PERFORMANCE to Trace, and re-create the issue (keep task running more than 10 minutes), then download the Diagnostics Packages;

2. Open a support ticket and upload the Diag Packages.

3. Do you have more than 1 tasks which connect to the same PostgreSQL database and running in parallel?

4. List all logic replication slots and their status in your DB.

Our support team will be more than happy to assist you.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

gkaur · ‎2024-09-02

Sorry I am back here again as I am not getting much help from anywhere else, we continue to lose data and real time replication is no way near replication. Also, noticed something in the log file... per my understanding it is indicting that at 21:56:32 task is still reading LSN as of 04:31:28 [ a total of 6.5 hrs or lag ] and yet the task indicates zero latency

------------

03138525: 2024-09-01T21:56:32:478018 [SOURCE_CAPTURE ]T: Throughput monitor: Last DB time scanned: 2024-09-01 04:31:28.232388-04. Last LSN scanned: 00019153/2E5BF958. #scanned events: 0. (postgres_endpoint_wal_engine.c:430)

Also attaching one of the log with verbose on

DesmondWOO · ‎2024-09-02

Hi @gkaur ,

According to your debug.txt file,

2024-09-01T13:20:32:767725 [SOURCE_CAPTURE ]T: Throughput monitor: Last DB time scanned: 2024-09-01 04:31:26.033978-04. Last LSN scanned: 00019153/2C1683C8. #scanned events: 4. (postgres_endpoint_wal_engine.c:430)
2024-09-01T13:22:32:815257 [SOURCE_CAPTURE ]T: Throughput monitor: Last DB time scanned: 2024-09-01 04:31:26.033978-04. Last LSN scanned: 00019153/2C1683C8. #scanned events: 0. (postgres_endpoint_wal_engine.c:430)
2024-09-01T13:24:32:844730 [SOURCE_CAPTURE ]T: Throughput monitor: Last DB time scanned: 2024-09-01 04:31:26.033978-04. Last LSN scanned: 00019153/2C1683C8. #scanned events: 0. (postgres_endpoint_wal_engine.c:430)

It seems there are no changes in that slot.

Please run

select * from pg_replication_slots;

and check the following:
1. Are there any unused replication slots?
2. LSN status

If problem persists, I would recommend creating a support ticket.

Regards,
Desmond

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

SushilKumar · ‎2024-09-03

Hello @gkaur

Thanks for reaching out to Qlik via Qlik Community Support page. Such issue requires through analysis of Settings and logs, hence request you to reach out to technical Support via a case and provide Diagnostics package.

And request you to not to upload any Log here as it can be access by larger Community member then the intended Support personal.

Please refer to Replicate https://community.qlik.com/t5/Knowledge/How-to-collect-Diagnostics-Package-from-Qlik-Replicate/ta-p/...

Regards,

Sushil Kumar

john_wang · ‎2024-09-03

Hello @gkaur ,

From the log file, we can see the LSN proceeds eg in lines #2742 and #2750, The LSN number keeps increasing:

2024-09-01T13:17:42:783856 [SOURCE_CAPTURE ]V: WAL-tracker: RECTYPE: 'DML', Current LSN '00019153/2C12F178'

2024-09-01T13:17:42:783884 [SOURCE_CAPTURE ]V: WAL-tracker: RECTYPE: 'DML', Current LSN '00019153/2C12F348'

However as the task log file was encrypted we cannot recognize what's the DML is in line #2751:

2024-09-01T13:17:42:783886 [SOURCE_CAPTURE ]V: WAL event raw details: RecType=DML, Last XID encountered=-657438885

2024-09-01T13:17:42:783889 [SOURCE_CAPTURE ]V: ~{RgYAAAhJjfO3EUrafmQmPSTJRu7pBeJad3/iDCw01efxSKjCPQp8w6rTeUXCK589K5...

Please decrypt the task log file to understand what's the DML operations are.

And we also need the complete task log file (instead of piece only) to understand if the settings are correct eg:

wal_level = logical

Thanks,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

Errors - Unexpected Behavior