Restart of Replicate task

suvbin · ‎2023-06-11

Hi Team,

How come the resume of replicate task work for some tasks? Like if the task is struck or if the changes are not getting captured or if the latency is increasing ... generally resume of the task will be the solution .

So could you please brief how the resume of the task works for these scenarios. And is there any other benefits of "resume" option in Replicate.

Thanks,

kng · ‎2023-06-11

Hello team,

We have some good community articles which explain the process of stop & start in replicate. Please do check on the below link for more information

https://community.qlik.com/t5/Official-Support-Articles/Qlik-Replicate-Task-Stop-timeout-occurred/ta...

https://community.qlik.com/t5/Qlik-Replicate/Replicate-stop-and-resume-task-period/td-p/2073549

Let me know if this helps you

Regards,
Shivananda

sureshkumar · ‎2023-06-11

Hello @suvbin

Additional notes

"if the task is struck or if the changes are not getting captured or if the latency is increasing"

--> We also go with Advanced Run Options i.e Using the Advanced Run Options will allow you to go back in time and reprocess records from the past without running a full load.

Advanced Run Options | Qlik Replicate Help

Regards,

Suresh

suvbin · ‎2023-06-12

Thank you for the response. But i didn't got the clarity

How come the resume of replicate task , work for some issues .... Like if the task is struck or if the changes are not getting captured or if the latency is increasing ... generally we do "resume".. and task starts working fine...

so how come it works fine? what happens when we click on "resume"

Thanks.

Heinvandenheuvel · ‎2023-06-12

@suvbin , you are correct in that in a perfect world a resume would not be 'the same' as just letting it run.

The fact that a stop + resume sometimes clear things up suggests that certain soft error conditions are not optimally handled. Perhaps a missing timeout on certain queries, perhaps a timing issue for retries. Please realize that such issues could be within Replicate, but are not unlikely to be an issue in the database endpoint. Perhaps a slow memory leak.

With the stop + resume you'll get fresh DB connections, so memory caches there will be released and you'll get a fresh set of electrons ( 🙂 ) in the Replicate process. There could be tables which are no longer touches, sorter spaces to be released. I admit that his is not a hard answer, and that there is possibly a 50/50 chance that something fixed by a restart is in fact a soft bug, but until there is something reproducible there isn't much you can do, and reproducing is tricky as these situations often arise after days or longer of running. Supposedly you could take a process memory dump when it is 'stuck' but that would become a very painful, costly, debugging process. For sure you should check on open endpoint DB queries when 'stuck' to get an idea whether this is inside Replicate or outside and you might want to change logging to VERBOSE all, for a few minutes before you decide to stop.

Personally, if I were to manager a Replication environment, I might stop and resume once a week, whether it needs it or not. You can do so at a controlled, at a selected time instead of unexpected, under stress during full production . Whether that's once-a-week, once-a-day, or once a month I cannot judge. I'de say do a first experiment checking memory consumptions (DB side and Replicate side) and connection counts after a long run and compare with memory both right after resume and perhaps after an hour when a new stability point has been arrived at.

No full answer, but I hope this helps a little,

Hein

General Question

Performance