Solved: Re: Replicate task impacted our entire DB - Qlik Community

gkaur · ‎2024-01-18

Unfortunately, we ran into a FULL database outage today and as per DBA, it was due to our replicate task -

Some info about our task

Source end point is a Postgres DB and Target is Snowflake

Task with only 2 tables, implemented on 1/5

Error Logs-

1/18/2024 11:17:35 AM: Task reported “WAL reader terminated with broken connection / recoverable error. WAL stream loop ended abnormally. (STATUS_RECOVERABLE)”

1/18/2024 11:42:51 AM: WAL reader terminated with broken connection / recoverable error. WAL stream loop ended abnormally. (STATUS_RECOVERABLE)

1/18/2024 12:07:24 PM: WAL reader terminated with broken connection / recoverable error. WAL stream loop ended abnormally. (STATUS_RECOVERABLE)

1/18/2024 12:23:10 PM :

Creating Metadata Manager's utility components failed

Cannot create the source utility component

Failed while preparing stream component 'XXXXX'.

Failed to connect

RetCode: SQL_ERROR SqlState: 08001 NativeError: 101 Message: FATAL: the database system is in recovery mode

This is the time when DATABASE went down. DBA killed our replicate slot and found following message in the logs-

ERROR: could not write to data file for XID 1896253644: No space left on device

Need Replicate experts to advice, what could have caused the issue, did we miss any setting/ best practices and growing logs, inactive slots etc...?

john_wang · ‎2024-01-19

Hello @gkaur ,

Thanks for the update.
While using the PostgreSQL Logical Replication Slots, we must be careful enough at the WAL file size as an orphaned slot, or inactive slot, or a slow slot cannot reply the WAL fast enough, or other slot exceptions may lead the WAL files size increasing and impacts the whole PostgreSQL database. So monitor the WAL files sizes and the storage spaces is important for the DBA.
Here are a few limitations of PostgreSQL Replication Slots:

Orphaned Replication Slot: The WAL files are retained by the master when the replica disconnects. This also means that the pg_wal directory may run out of space. Imagine a scenario where the replica fails forever and cannot be recovered (an orphaned replication slot), or when a replica cannot replay the WAL segments fast enough.
Manually Monitoring pg_wal Directory: The WAL files will just pile up. So you need to monitor the slots and manually drop them. Only when you do that will the master delete anything from the pg_wal directory. You will see how you can monitor and drop replication slots later. So while WAL retention is taken care of without manual settings, pg_wal directory space needs to be manually monitored.

Some suggestions to avoid the storage spaces issues:
1. Allocate reasonable free spaces for WAL files, depends on the peak transactions volume size of the system
2. Enable the WAL heartbeat in the PostgreSQL source endpoint

3. Limit the size of a slot in PostgreSQL (You can apply max_slot_wal_keep_size on the source database in PostgreSQL 13 and later) ( AWS Database Migration Service (AWS DMS) is OEM of Qlik Replicate).

4. Delete the useless slot(s). Stop the Qlik Replicate task will not drop the slot, the slot should be dropped by the DBA.

BTW, if the slot is dropped to free up the spaces, then the slot should be created again. All the data changes during the previous task failure to the new slot creation cannot be captured anymore. The task should be reload and start capture changes from the slot creation time.

Hope this helps.
John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

View solution in original post

john_wang · ‎2024-01-19

Hello @gkaur ,

Thanks for reaching out to Qlik Community!

Unfortunately, we ran into a FULL database outage today and as per DBA, it was due to our replicate task -

Are you meaning the Qlik Replicate task(s) caused the PostgreSQL source database outage (storage spaces, or memory, or other outages)?

Error Logs-

1/18/2024 11:17:35 AM: Task reported “WAL reader terminated with broken connection / recoverable error. WAL stream loop ended abnormally. (STATUS_RECOVERABLE)”

In general this is a network connection broken issue, or a timeout issue, it's hard to tell from the error message only. May be enlarge wal_sender_timeout or TCP/IP related parameters help. Please open a support ticket and attach the task Diagnostics Packages (It's better to set SOURCE_CAPTURE to Trace or Verbose, recreate the issue, then download the Diag Packages), Qlik Support team will help you further.

RetCode: SQL_ERROR SqlState: 08001 NativeError: 101 Message: FATAL: the database system is in recovery mode

This is the time when DATABASE went down. DBA killed our replicate slot and found following message in the logs-

pginst11_prod 2024-01-18 11:17:35 EST [88436]: [12-1] db=spin,user=gen_qlik_auth,app=QlikReplicate/2023.5 (spin),client=10.1.30.150 ERROR: could not write to data file for XID 1896253644: No space left on device

Looks like Qlik Replicate failed to recover from connection failure as the database system is in recovery mode.

Please elaborate "DBA killed our replicate slot": the slot process was forcely stopped, or the slot was dropped? Was some spaces was released and the PostgreSQL database works fine now?

Please provide above information in the support ticket, and support team will pick it up from there.

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

deepaksahirwar · ‎2024-01-19

Dear @gkaur ,

Based on yourr query you are experiencing an issue with your Qlik Replicate task, which is causing a full database outage.
The error logs you've provided indicates that the problem might be related to the database running out of space.
This is indicated by the error message `ERROR: could not write to data file for XID 1896253644: No space left on device`.

In such a scenario,the first step is to ensure that there is enough disk space available for the database operations.
If the disk is full, it could lead to errors and potentially cause a database outage.
You might need to free up some space or add more storage to your database server.

Go to the server and explore the disk drive where qlik is installed. inside that search for logs. Possible delete all the past period logs after taking backup on secondary drive.This will empty some space.

Next, you should check the configuration of your Qlik Replicate task. Make sure that the settings are optimized for your specific use case. For instance, you might want to look at the error handling settings and adjust them if necessary.

It's also important to regularly monitor and manage the logs generated by Qlik Replicate. If the logs are growing rapidly, they could consume significant disk space. You might need to implement a log rotation strategy to manage the size of the logs.

Regular database maintenance is crucial to ensure optimal performance. This includes tasks like updating statistics, rebuilding indexes, and managing database files.

Lastly, ensure that you're following the best practices for using Qlik Replicate. This includes things like proper sizing, testing, and administration.

Remember, it's always a good idea to work closely with your DBA and Qlik Replicate administrators when troubleshooting issues like this. They can provide valuable insights and help you implement the necessary changes. If the issue persists, you might want to consider reaching out to Qlik Support for further assistance.

In addition to these steps, there are a few more specific solutions that might help resolve your issue. For instance, if you're seeing errors related to "Replicate LOB columns", you might need to enable this setting in your Log Stream Staging task. If the issue is related to transaction logs filling up on the target database, you could consider disabling transaction logging on the target or setting up a job to backup and truncate the transaction logs regularly. If you're encountering errors related to insufficient system memory, you might need to increase the memory allocation for the SQL Server or optimize your queries to reduce the amount of memory required to execute them.

I hope this helps! Let me know if you have any other queries.

If our response has been helpful, please consider clicking "Accept as Solution". This will assist other users in easily finding the answer. 🙂

Best Regards,

Deepak Ahirwar

Prabha-Ande · ‎2024-01-19

Hi @john_wang ,

Are you meaning the Qlik Replicate task(s) caused the PostgreSQL source database outage (storage spaces, or memory, or other outages)?

Yes - It was storage space exhausted on production database because of inactive slot storing transaction logs

Please elaborate "DBA killed our replicate slot": the slot process was forcely stopped, or the slot was dropped? Was some spaces was released and the PostgreSQL database works fine now?

Inactive replication slot was dropped to create space in database and PG database works fine now.

But we are looking at best practices to handle these scenario proactively. How to stop replication tasks to consume database space.

We will raise a case with Qlik Support but would appreciate if we can get some pointers around how to avoid these Inactive WALSLOT Issues

SushilKumar · ‎2024-01-19

Hello team,

to add more on to Tse comment replication slot is a feature in PostgreSQL that ensures that the master server will retain the WAL logs that are needed by the replicas even when they are disconnected from the master and Logical replication brings over only the SQL-like changes. It does not work without replication slots.

hence its more of kind of capacity planning if Qlik replicate have to process more data then its quite obvious WAL created which is used by Qlik also need expansion. Hence its important to revisit the Space allocation and do the needful.

Regards,

Sushil Kumar

john_wang · ‎2024-01-19

Hello @gkaur ,

Thanks for the update.
While using the PostgreSQL Logical Replication Slots, we must be careful enough at the WAL file size as an orphaned slot, or inactive slot, or a slow slot cannot reply the WAL fast enough, or other slot exceptions may lead the WAL files size increasing and impacts the whole PostgreSQL database. So monitor the WAL files sizes and the storage spaces is important for the DBA.
Here are a few limitations of PostgreSQL Replication Slots:

Orphaned Replication Slot: The WAL files are retained by the master when the replica disconnects. This also means that the pg_wal directory may run out of space. Imagine a scenario where the replica fails forever and cannot be recovered (an orphaned replication slot), or when a replica cannot replay the WAL segments fast enough.
Manually Monitoring pg_wal Directory: The WAL files will just pile up. So you need to monitor the slots and manually drop them. Only when you do that will the master delete anything from the pg_wal directory. You will see how you can monitor and drop replication slots later. So while WAL retention is taken care of without manual settings, pg_wal directory space needs to be manually monitored.

Some suggestions to avoid the storage spaces issues:
1. Allocate reasonable free spaces for WAL files, depends on the peak transactions volume size of the system
2. Enable the WAL heartbeat in the PostgreSQL source endpoint

3. Limit the size of a slot in PostgreSQL (You can apply max_slot_wal_keep_size on the source database in PostgreSQL 13 and later) ( AWS Database Migration Service (AWS DMS) is OEM of Qlik Replicate).

4. Delete the useless slot(s). Stop the Qlik Replicate task will not drop the slot, the slot should be dropped by the DBA.

BTW, if the slot is dropped to free up the spaces, then the slot should be created again. All the data changes during the previous task failure to the new slot creation cannot be captured anymore. The task should be reload and start capture changes from the slot creation time.

Hope this helps.
John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

SushilKumar · ‎2024-01-22

Hello team,

If our response has been helpful, please consider clicking "Accept as Solution". This will assist other users in easily finding the answer.

Regards,

Sushil Kumar

Replicate task impacted our entire DB

Best Practices

Errors - Unexpected Behavior

General Question