Solved: Re: Qlik Replicate Functionality Understanding (So... - Qlik Community

vinayak_m · ‎2024-06-26

Hi Team,

There were certain scenarios that our team is working on, would request your expert guidance and best practices to navigate through those.

Information: Source Endpoint IBM DB2 for LUW
Target Endpoint AWS S3
Task Type: Full Load + CDC
Replicate version: May 2023,363 Build

Scenarios are as follows,

Source System Failure:
Let's say there is a an unplanned downtime or maintenance on DB2 what will happen next?

My Understanding: Qlik Replicate tasks having retries set, will try and continuously fail until the system is up and once the system is up the tasks will continue on as usual without missing any data?

Is there any possibility that the logs are not maintained in the system so we would not be able to continue on with the tasks?

(Because if the logs are maintained in the system the above understanding would be correct, but if the logs are not maintained then how can we proceed?
Would picking the last saved state, LSN and then starting the task from there itself help? or a full load will have to be done? since the logs are not maintained there is no significance of LSN?)

Let's say there is a planned downtime/maintenance on DB2

My Understanding: Qlik Replicate tasks will all be stopped, after the maintenance is finished, we can resume the tasks and things will go smoothly.

Any other thing which we might need to plan/configure to ensure there are no failures in data extraction.
What happens if the On-premise Qlik server crashes

My understanding: Since after the crash of the server, all tasks would be stopped, so we can check logs identify the last LSN and then resume the tasks from the last LSN?

Would this be a correct approach to proceed?
And if there are a lot of tasks how can we automate or make this process dynamic?

Let's say we have a cloud Qlik replicate server as well, would creating identical tasks on that and then starting the CDC from the last LSN on the cloud server would help us in ensuring that the data integrity is maintained?

Please let me know if my understanding is incorrect/ there is some information that I am not aware of as of now.

Thanks in advance team.
Any inputs/approaches/best practices are welcome.

DesmondWOO · ‎2024-06-27

Hi @vinayak_m ,

Replicate saves the last successful stream position in the repository. If the server crashes, you can just resume the task, as that should be enough to continue from the last saved position.

However, if you need to resume the task from a specific LSN, you may need to enable the Replicate status control table (attrep_status) to check the position. This is because if the server crashes, Replicate does not have a chance to report the last stream position in the task log.

As mentioned earlier, simply resuming the tasks should be enough in most cases. You can write a script to start the tasks using the "repctl" command.

Regards,
Desmond

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

View solution in original post

john_wang · ‎2024-06-27

Hello @vinayak_m ,

Additional to @DesmondWOO comments.

Assume you have set up similar Qlik Replicate tasks on a 'standby' machine. There are several options depending on whether you can retrieve the necessary stream position information from the old Replicate server:

Retrieve LSN: If you can get the Log Sequence Number (LSN) from the 'crashed' server task log files, resume the tasks from the last LSN. If this is not possible, proceed to step 2.
Please take note the control table (attrep_status) is not supported on S3 target.
Restart from a Timestamp: Restart the tasks from a recent timestamp (e.g., 2-5 minutes before the crash) to ensure no records are missed. Be aware that multiple transactions may have occurred at the same timestamp, which could cause Qlik Replicate to retrieve duplicate records, leading to primary key violation warnings/errors, duplicate records, or deletion operation failures on the target side.
Reload the Tasks: If neither of the above options works, reload the tasks entirely.

The recovery approach is the same for all tasks. However, since the stream position differs for each task, you need to process them individually.

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

View solution in original post

john_wang · ‎2024-06-26

Hello @vinayak_m ,

Thanks for reaching out to Qlik Community!

Qlik Replicate is a Change Data Capture (CDC) product that uses transaction logs to retrieve change records from DB2 Transaction Logs. Here's how it works and what to do during maintenance:

Stream Position Maintenance:
- Qlik Replicate maintains the stream position (including Log Sequence Number, LSN of DB2) in its repository.
- The task can resume from the stored stream position on the next startup.
Maintenance Recommendations:
- Before Maintenance: It is recommended to stop the Qlik Replicate task before performing maintenance on DB2.
- After Maintenance: Resume the task once the maintenance is complete. Alternatively, you can start the task from a specific LSN (e.g., immediately after the maintenance job is done) to skip unnecessary DML operations performed during maintenance or DB2 upgrades.
Handling Connection Interruptions:
- Qlik Replicate attempts to recover connections from unexpected source database connection failures (e.g., unplanned downtime).
- You can define the retry interval and other related settings.
- Ensure all DB2 transaction logs are intact. If any transaction logs are missing, a reload will be necessary.
Task Resumption:
- Qlik Replicate tasks can be resumed if the repository is not corrupted and the DB2 transaction logs are available.
- If the task fails to resume automatically, you can start the task from a specific timestamp or LSN. Refer to the example below.

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

vinayak_m · ‎2024-06-27

Hi @john_wang ,

Thank your for your explanation, I think I have a fair understanding now to what you've mentioned.
However I don't see any response for the second part of my post, re-posting it here so that you can take a look and provide me with the appropriate information on this,

What happens if the On-premise Qlik server crashes

My understanding: Since after the crash of the server, all tasks would be stopped, so we can check logs identify the last LSN and then resume the tasks from the last LSN?

Would this be a correct approach to proceed?
And if there are a lot of tasks how can we automate or make this process dynamic?

Let's say we have a Cloud Qlik replicate server as well, would creating identical tasks on that and then starting the CDC from the last LSN on the cloud server would help us in ensuring that the data integrity is maintained?(Assuming we'll point to the same target and location because we have created identical tasks)

Please let me know if my understanding is incorrect/ there is some information that I am not aware of as of now.

DesmondWOO · ‎2024-06-27

Hi @vinayak_m ,

Replicate saves the last successful stream position in the repository. If the server crashes, you can just resume the task, as that should be enough to continue from the last saved position.

However, if you need to resume the task from a specific LSN, you may need to enable the Replicate status control table (attrep_status) to check the position. This is because if the server crashes, Replicate does not have a chance to report the last stream position in the task log.

As mentioned earlier, simply resuming the tasks should be enough in most cases. You can write a script to start the tasks using the "repctl" command.

Regards,
Desmond

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

john_wang · ‎2024-06-27

Hello @vinayak_m ,

Additional to @DesmondWOO comments.

Assume you have set up similar Qlik Replicate tasks on a 'standby' machine. There are several options depending on whether you can retrieve the necessary stream position information from the old Replicate server:

Retrieve LSN: If you can get the Log Sequence Number (LSN) from the 'crashed' server task log files, resume the tasks from the last LSN. If this is not possible, proceed to step 2.
Please take note the control table (attrep_status) is not supported on S3 target.
Restart from a Timestamp: Restart the tasks from a recent timestamp (e.g., 2-5 minutes before the crash) to ensure no records are missed. Be aware that multiple transactions may have occurred at the same timestamp, which could cause Qlik Replicate to retrieve duplicate records, leading to primary key violation warnings/errors, duplicate records, or deletion operation failures on the target side.
Reload the Tasks: If neither of the above options works, reload the tasks entirely.

The recovery approach is the same for all tasks. However, since the stream position differs for each task, you need to process them individually.

Hope this helps.

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

Qlik Replicate Functionality Understanding (Source IBM DB2 for LUW)

Best Practices

Configuration

Connectivity - Sources or Targets

Errors - Unexpected Behavior

Functionality

General Question

Performance