Understanding SQL Agent jobs when troubleshooting

aftertaf_alv · ‎2022-03-23

Hello,

Background

We are using Replicate to stream data from MSSQL 2016/2019 into a logstream, and from there we push data to multiple targets.

Under advice from Qlik support, we have set the source database connection in our logstream task to have "Exclusively use sp_repldone within a single task".

TLOG prevention policy is set to 1800sec.

Our backup solution, CommVault, is reading the MDF & LDF via a CommVault Agent and storing the backups on a remote server (not on the SQL box).

Select virtual backup device types is ticked.

Internal parameters for cdcTimeout and exexuteTimeout are both set to 1200.

As we are enabling SQL Replication feature on the source MSSQL, and configuring it to be its own Distributor, there are many SQLAgent jobs that get created during this process.

Settings Summary

cdcTimeout : 1200
database : MYDatabase
executeTimeout : 1200
safeguardFrequency : 1800
safeguardPolicy : EXCLUSIVE_AUTOMATIC_TRUNCATION
server : MYServer
use3rdPartyBackupDevice : true

Situation

When the source servers are under a lot of stress (Index rebuild/ maintenance jobs) we have seen that the Logstream task can end up failing.

If we are not able to react quickly enough, this can lead to the source system LDF file filling up and production is affected.

This has happened frequently enough on various servers for us to be concerned about how to get things running properly again, and what the 'normal' should look like.

Regarding SQL Agent jobs

We see these jobs :

Is it safe to delete them ?

Are any needed in case we need to recover after a failure of some kind ?

Also, database maintenance, reindexing, of even heavy load are situations that we cannot avoid, but we need to be able to mitigate. Are there settings that we should change, to allow the replicate jobs to not fail when the database is under duress ?

Heinvandenheuvel · ‎2022-03-23

@aftertaf_alv thank you for a refreshingly detailed problem description. I wish more folks would understand that's the only way to get good help and not waste time in back-and-forth zooming in. Thanks.

That said, there is one major detail omitted. - "the Logstream task can end up failing. " What are the error messages ???

My guess is that it cannot read an archive log when the CDC information is only in the Archive log 'LDF', not in the active Tlog. I'm thinking this because you wrote "storing the backups on a remote server (not on the SQL box)." without indication how the SQLserver box has access to those backups. What does a typical "physical_device_name" in table dbo.backupmediafamily. Is that a valid/accessible filename for Replicate to provide as argument for its fn_dump_dblog calls? Are you perhaps using "Backup folder preprocessing command" to make a required archive Tlog accessible?

I don't know too much about those agents and excessively growing Tlogs. I do know that normally SQL server has a job to truncate the Tlog when possible. When configuring Replicate for sp_repldone (as per User Guide) "If the Log Reader Agent is running when the task starts, Qlik Replicate will forcibly stop it." - which appears to be the case. So it's up to the sp_repldone but that's set to 1800 seconds (half an hour) that feels like way too long. What was wrong with the default 300 (5 minutes). I'd be tempted to go more aggressive, not less!

Considering all the other requirements for using sp_repldone I would also consider to just use the "Start transactions in the database" method until proven that this creates trouble.

Good luck!

Hein

Steve_Nguyen · ‎2022-04-01

@aftertaf_alv did Hein advise help ?

Help users find answers! Don't forget to mark a solution that worked for you! If already marked, give it a thumbs up!

Background

Settings Summary

Situation

Regarding SQL Agent jobs

Best Practices

CDC - 1-many - Log Stream

Configuration

Connectivity - Sources or Targets

Performance