During periods of high load Redshift may queue workload from users and these users will wait until workload manager allows execution of the query. Also, during periods of high load, CPU may be contended, causing Qlik Replicate tasks to slow down.
Timeout in Qlik Replicate is configured for X seconds (it can be up to 2 hours, depending on the scenario) to avoid timeout. Increasing the timeout will likely not improve the situation.
When one-by-one mode is applied, the latency on the apply increases. On each occasion when one-by-one is observed, there is a manual intervention to stop the task and then restart it. Batch mode restarts and the latency is recovered very quickly.
The enhancement is to handle differently TIMEMOUT or BAD CONNECTION in the Qlik Replicate Redshift endpoint, including an internal parameter available for the endpoint called something like “timeout batch retry count:5” and “timeout batch retry interval:300”:
- When these internal parameters are set in the endpoint for Redshift, the default behavior when a non-data related error such as TIMEOUT or BAD CONNECTION in Redshift is raised that rather than going into one-by-one mode a number of retries are attempted in batch mode.
- With “timeout batch retry count:5” five attempts would be executed on the batch
- With “timeout batch retry interval:300” each attempt would wait 300 seconds (5 minutes) before retrying.
This removes manual intervention and provides a more hands-off approach to Replicate in BAU operations mode. The TIMEOUT and BAD CONNECTION are likely regular occurrences in Redshift and the less hands on for manual work the better as this error will manifest in BAU.