Qlik Replicate: MS_REPLICATION vs MS-CDC

Sonja_Bauernfeind · Jun 3, 2022 9:05:33 AM

The objective of this document is to provide an overview of the SQL Server replication techniques available for Qlik Replicate and assist the reader to determine the best approach for a particular use case.

SQL Server data replication
What is MS-REPLICATION based CDC
Overview
Architecture
Specifics
What is MS-CDC based CDC
Overview
Architecture
Specifics
What if BOTH MS-CDC and MS-REPLICATION are enabled?
Choosing the right approach
MS-CDC based Qlik CDC will be helpful when:
MS-Replication based Qlik CDC will be mostly helpful when
Glossary
Additional Resources

SQL Server data replication

Replication is a set of technologies for copying and distributing data and database objects from one database to another and then synchronizing between databases to maintain consistency. There are two replication methods supported by Qlik Replicate – MS-REPLICATION based and MS-CDC.

What is MS-REPLICATION based CDC

Overview

Qlik CDC based on the MS-REPLICATION method leverages the transactional replication mechanism available in the SQL Server. Transactional replication is implemented by the SQL Server Snapshot Agent, Log Reader Agent, and Distribution Agent. The Snapshot Agent prepares snapshot files containing schema and data of published tables and database objects, stores the files in the snapshot folder, and records synchronization jobs in the distribution database on the Distributor.

In the transactional replication the Log Reader Agent monitors the transaction log of each database configured for transactional replication and copies the transactions marked for replication from the transaction log into the distribution database, which acts as a reliable store-and-forward queue. The Distribution Agent copies the initial snapshot files from the snapshot folder and the transactions held in the distribution database tables to Subscribers.

Having said that, Qlik replicate does not explicitly use any of the MS-Replication mechanisms (neither agents nor the distributor), but rather leverages the SQL Server publications and articles, to write additional data about DML and DDL operations into SQL Server TLOG. This information will be later read and then parsed by Qlik Replicate to be distributed to the target endpoint of your choice.

Transactional logs can be read using several Change Processing Modes:

Prioritize Online Logs - This is the default. Qlik Replicate will first look for the changes in the online transaction logs. If Replicate cannot find the changes in the online transaction logs, it will look for them in the backup transaction logs instead.
Prioritize Backup Logs - When this option is enabled, Qlik Replicate will first look for the changes in the backup transaction logs. This can improve performance when reading from the online transaction log is slow (e.g due to lock contention) or when using file-level access to access the backup transaction logs.

If Replicate cannot find the changes in the backup transaction logs, it will look for them in the online transaction logs instead.

Backup Logs Only - When this option is selected, Qlik Replicate will try and find the changes in the backup transaction logs only. Selecting this method results in increased latency due to the interval between backups. The actual latency time will remain constant but will vary according to the backup schedule.
Online Logs Only - When this option is selected, Qlik Replicate will try and find the changes in the online transaction logs only.

There are few different methods available in Qlik Replicate to read the SQL Server Transaction logs:

Read the online\backup logs using SQL server functions:
- fn_dblog() stored procedure is used to read the online Tlog data
- fn_dblog_dump() stored procedure is used to read the backup Tlog
Direct File access to backup logs in native SQL server format

Please review Qlik Replicate documentation, chapter “Using Microsoft SQL Server as a source” à “Setting advanced connection properties”

Architecture

The image below depicts a subset of MS-REPLICATION which is utilized by Qlik Replicate. This article does not cover the complete MS-REPLICATION process.

Qlik Replicate SQL Server and Replication DB Architecture MS-REPLICATION .png

In Qlik’s specific architecture, a database will be configured for replication using the standard Studio Wizard. As part of this configuration, the distribution database will be created on the publisher instance. When Qlik Replicate will set up the publication and the articles, it will also add an always-false filter (1=0) to prevent any changes be written into the Distribution Database. The Distribution Database will however contain stored procedures and settings used to allow the replication. Log Reader Agent will read the transactional log and mark the activity as replicated to allow active TLOG truncation.

As a next step Qlik replicate will access the log either using internal SQL Server functions (fn_dblog \ fn_dump_dblog) or direct TLOG backup access to read the content of the log and parse it inside Replicate to extract the DML\DDL activity.

Specifics

Distribution configuration done on the server while the server being its own distributor
Only tables with a PK can be replicated

BULK LOGGED or FULL Recovery Model is required. Full backup to be executed after changing the recovery model

What is MS-CDC based CDC

Overview

Change Data Capture, also known as CDC, introduced the first time in SQL Server 2008 version, as a helpful feature to track and capture the changes that are performed on the SQL Server database tables, with no additional programming efforts. Before SQL Server 2016, Change Data Capture could be enabled on a SQL Server database only under the SQL Server Enterprise edition, which is not required starting from SQL Server 2016.

Change Data Capture tracks the INSERT, UPDATE and DELETE operations on the database table, and records detailed information about these changes in a mirrored table, with the same columns structure of the source tables, and additional columns to record the description of these changes. SQL Server writes one record for each INSERT statement showing the inserted values, one record for each DELETE statement showing the deleted data and two records for each UPDATE statement, the first one showing the data BEFORE the change and the second one showing the data AFTER performing the change.

Architecture

Change Data Capture requires that a SQL Server Agent be running on a SQL Server instance. When the feature is enabled on a SQL Server database table, two SQL Server Agent jobs will be created for that database; the first job is responsible for populating database change tables with the changes information and the second job is responsible for the cleaning up the change tables by deleting the records older than the configurable retention period of several days.

Qlik Replicate SQL Server and Replication DB Architecture.png

In this context Qlik replicate will access the Change Tables on the source to get all the changes since the last iteration.

Active Guard specifics

To find out more about Active Guard MS-CDC specifics, please review the dedicated Miscrosoft Documentation page

Specifics

There are some limitations that come together with this method. Those include:

DDL changes may not be captured
If the SQL Server Agent service is not running, CDC capture job will not work, and the database log file will grow rapidly, even if the database recovery model is Simple, as the log truncation will not advance, even if a CHECKPOINT is performed, till all the changes that are waiting for capture will be gathered by CDC capture process.

These limitations of MS-CDC are documented both by Microsoft and Qlik

What if BOTH MS-CDC and MS-REPLICATION are enabled?

In the case that MS-Replication and MS-CDC are both enabled on the server there will be no Capture job running on the source database. The job of capturing the changes and writing to the change tables will be done by the replication log-reader job. Qlik Replicate will also need special configuration to handle that type of setup. E.g. adding the internal parameter “skipMscdcJobFitnessCheck” set to “True” in the source endpoint advanced settings. Also note that you cannot use the truncation prevention method ‘Exclusively use sp_repldone within a single task’ as this stops the log-reader job

Choosing the right approach

After having understood the 2 approaches, the chapter below identifies scenarios where MS-CDC based replication will be preferred over the MS-REPLICATION

MS-CDC based Qlik CDC will be helpful when:

SaaS based Microsoft Cloud DB Offerings where MS-REPLICATION cannot be enabled by definition
Replication of tables without a defined Primary Key
On-prem 3^rd party backup forcing to go online log only
- Variant - Extremely short retention period for native TLOG backups which does not allow to consume the TLOG backups consistently
Read the changes from secondary replica in the AG
Limited Access or Slow access to TLOGs
- Frequent Maintenance (index rebuild) contributing to backlogs in data replication
- Large volume of OLTP
- Large, compressed logs take long time to be processed

MS-Replication based Qlik CDC will be mostly helpful when

Database architecture allows for this type of approach (non-SaaS offerings)
Proximity to a source database can be secured
If backups available - done in native format
- When reading compressed TLOGs directly
  - Fast and large disk may be required for the installation purposes, as the TLOG will be decompressed to it during its transfer from the file system
  - Log Read Buffer Size may require additional tweaks to increase - configure the MS-REPLICATE source endpoint with the internal variable “backupFileReadBufferSize” set to 4194304 (4 times the default). Bigger buffer could also be assigned if needed.

Glossary

Article - an article is the basic unit of SQL Server Replication. An article can consist of tables, stored procedures, and views. It is possible to scale the article, horizontally and vertically using a filter option. We can also create multiple articles on the same object with some restrictions and limitations.

Publication - a publication is a logical collection of articles from a database. The entity allows us to define and configure article properties at the higher level so that the properties are inherited to all the articles in that group.

Publisher database - The publisher is a database that contains a list of objects that are designated as SQL Server replication articles are known as publication database. The publisher can have one or more publications. Each publisher defines a data propagation mechanism by creating several internal replication stored procedures.

Publisher - The Publisher is a database instance that makes data available to other locations through SQL Server replication. The Publisher can have one or more publications, each defining a logically related set of objects and data to replicate.

Distributor - The Distributor is a database that acts as a storehouse for replication specific data associated with one or more Publishers. In many cases, the distributor is a single database that acts as both the Publisher and the Distributor. In the context of SQL Server replication, this is commonly known as a “local distributor”. On the other hand, if it’s configured on a separate server, then it is known as a “remote distributor”. Each Publisher is associated with a single database known as a “distribution database” aka the “Distributor”.

The distribution database identifies and stores SQL Server replication status data, metadata about the publication, and, in some cases, acts as a queue for data moving from the Publisher to the Subscribers.

Distribution databases - Each Distributor must have at least one distribution database. The distribution database consists of article detail, replication meta-data and data. A Distributor can hold more than one distribution database; however, all publications defined on a single Publisher must use the same distribution database.

Replication agents - SQL Server replication uses a pre-defined set of standalone programs and events are known as agents, to carry out the tasks associated with data. By default, SQL Server replication agents run as scheduled jobs under SQL Server Agent. Replication agents can also be run from the command line and by applications that use Replication Management Objects (RMO). SQL Server replication agents can be monitored and administered using Replication Monitor and SQL Server Management Studio.

Replication snapshot Agent - the Replication snapshot Agent is used with all types of SQL Server replication technology as it provides the required data set to perform the initial data synchronization of the publication database with the subscription database. It prepares schema and initial data of published articles, snapshot files, and records information about the synchronization type in the distribution database.

Log Reader Agent - the Log Reader Agent is used only with transactional replication. It moves replication transactions from the online transaction log of the publication database to the distribution database.

Distribution Agent - the Distribution Agent is used only with Replication snapshot and Transactional SQL Server replication. This agent applies the initial replication snapshot to the subscription database and later, the data changes are tracked and recorded in the distribution database and applied to the subscription database.

Merge Agent - the Merge Agent is used with the merge replication model. By default, the Merge Agent uploads changes from the Subscriber to the Publisher and then downloads changes from the Publisher to the Subscriber. Each subscription has its own Merge Agent that connects to both the Publisher and the Subscriber and updates both. The Merge Agent runs at either the Distributor for push subscriptions or the Subscriber for pull subscriptions. Here, the synchronization is bi-directional. The data conflicts are handled by a set of triggers that supports the entire process

fn_dblog() - is one of several undocumented functions for SQL Server; it allows you to view the transaction log records in the active part of the transaction log file for the current database

fn_dump_dblog() - is another undocumented function used to analyze the transaction log backup file of the database

Additional Resources

Replication, change tracking, & change data capture - Always On availability groups
SQL Server replication: Overview of components and topography

Qlik Replicate: MS_REPLICATION vs MS-CDC