Process integrity checks

Prabodh · ‎2021-08-06

Hello All,

We are using Replicate to move data from our operational systems like Oracle, SQL Server etc. to our cloud based data warehouse. We run Replicate 24x7 to get a realtime data ingestion to warehouse.

We are looking to implement process integrity checks to determine if we received all the data from source to target.

Has anyone implemented solutions to solve this issue?

Thanks!

badgerriaan · ‎2022-02-02

Hi. We are also at a place where this is becoming needed. Did you end up finding a solution to this or still looking? I was thinking of maybe having to build Qlik sense Apps that compare the Landing to the Warehouse layer, or something like that 🙂

Heinvandenheuvel · ‎2022-02-02

There are no hard and fast solutions for this.

Several folks run count on source and target and compare. This is best done with filter perhaps on 'last update' to ask for specific time window to avoid latency effects.

Others just watch max('last update') for a known-to-be-have-frequent-changes table if they have such column to see if they are sufficiently in sync. Some even run CRC checks on concatenated row columns to see if not only all rows are there, but are identical.

It's all a mess, and rarely solid. Best thing to do is trust the process!. Of course you trust and verify some. Check those reptask logs for errors; check the attrep_apply_exceptions table; check coarse change counts against expectations; and may after that add a count rows on source and target.

There is one thing I like for confidence and (performance) monitoring:

Add a 'heartbeat' or 'tickle' table to the source, probably in a special schema.
Give it Identity PK column, a TS_Source timestamp defaulting to current timestamp (sysdate) and possible a generic text column.
Add a procedure/job on the source to add a row to the heartbeat table every minute (or 10 seconds or whatever)
Add the table to the task and add a TS_Replicate column with datetime('now') and a TaskName column with fixed tekst or $AR_V_TASK_NAME. Opionally add $AR_V_TASK_START_TIME to catch/document restarts.
On the target add a column TS_Target defaulting to the current timestamp.

This table is end-to-end: no need to ask Replicate whether all was well!

It should get 1440 rows per day per task (assuming 1 minute interval). It allows you to plot or report avg/min/max end-to-end latency over multiple days.

hope this helps,

Hein

Best Practices

Functionality