Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
So we have an ETL Process that once the data comes in from our source into snowflake that process transforms the data. We are using the replication date to know when we receive new data. One issue we have run into is when we have to reload a table because of an error or whatever reason. It then updates all of the replication dates to when the table was reloaded. The problem is that those records already exist downstream and our process wants to reprocess all of those records. I am trying to find out if there is any flag or message I can grab that would tell the downstream process that this table was just reloaded.
Hi @pkagel ,
Go to the Expression Builder--> Variables--> Then select the $AR_V_TASK_RELOAD_TIME.
Below is the attachment for your reference:
Hi @pkagel
Please correct me if I'm wrong but the end result we trying to achieve is that when we do a reload, downstream application should not process all the records from the table but just the ones which came in new. So the question is does the table maintain create_date/last_update date in the source?
If Yes then we have to build the downstream logic based on the last_update column which is the right representation of when the record was changed.
If not then it would be difficult since the Full Load process might drop and recreate the table or truncate the existing data and reload which will wipe out any history we had before.
There is a Do Nothing option for Full Load which is documented to process only new changes but from my experience I've seen it creating duplicates and might not help. But can try it in a non-Prod environment and check if it helps.
Thank you,
Hello,
You can play around with this and maybe it will help.
I created a global transformation and added a new column for reload_date.
CASE
WHEN $AR_H_STREAM_POSITION = ''
then datetime('NOW','localtime')
end
The field is only populated when a full load is triggered. I tested with 2 tables. On the initial load, both tables (table a and b) have a value of "2022-08-05 10:28:17.000000" when i reloaded table a, this now has a value of "2022-08-05 10:29:34.000000" and table b retains the original value however on CDC this column will be set to NULL
I will run additional tests - this is what i have for now 🙂
Thanks
Lyka
Hi @pkagel ,
You can find the information in the task log as below:
00011956: 2022-08-04T11:25:56 [TARGET_LOAD ]I: Load finished for table 'dbo'.'literal_pk' (Id = 1). 5 rows received. 0 rows skipped. Volume transferred 2224. (streamcomponent.c:3976)
00011956: 2022-08-04T11:25:56 [TARGET_LOAD ]I: Load finished for table 'dbo'.'literal_pk' (Id = 1). 5 rows received. 0 rows skipped. Volume transferred 2224. (streamcomponent.c:3976)00043892: 2022-08-04T11:25:59 [TASK_MANAGER ]I: Loading finished for table 'dbo'.'literal_pk' (Id = 1) by subtask 1. 5 records transferred. (replicationtask.c:3012)
Thanks,
Swathi
Hi @pkagel ,
Using global transformation, you can add different columns to populate the Reload date, Insert time, and update time at Target.
Thanks,
Swathi
So what field would tell me that it was a reload and not just a normal insert?
Hi @pkagel ,
Go to the Expression Builder--> Variables--> Then select the $AR_V_TASK_RELOAD_TIME.
Below is the attachment for your reference:
Hello @pkagel ,
You can also use the "AR_H_TIMESTAMP" in global transformation, When you initiate the full load it will just append the "Current time" when the actual load is done and For Change processing "The timestamp of the Change operation, which is represented as the local time on Replicate Server. If your database and Replicate Server are located in the same timezone, the timestamp will be approximately representative of the actual Change operation"
Here is the screenshot for your reference.
Hope this helps for you.
Regards,
Sachin B
This field doesn't work when you only reload a table. Not a full reload of all the tables in the task I just want to know when a table gets reloaded. Is there any other option?
Hi @pkagel
Please correct me if I'm wrong but the end result we trying to achieve is that when we do a reload, downstream application should not process all the records from the table but just the ones which came in new. So the question is does the table maintain create_date/last_update date in the source?
If Yes then we have to build the downstream logic based on the last_update column which is the right representation of when the record was changed.
If not then it would be difficult since the Full Load process might drop and recreate the table or truncate the existing data and reload which will wipe out any history we had before.
There is a Do Nothing option for Full Load which is documented to process only new changes but from my experience I've seen it creating duplicates and might not help. But can try it in a non-Prod environment and check if it helps.
Thank you,
Thank you. Unfortunately the application does not always properly populate those fields so we can't use them. I was referring to the other suggestion about using the creation_date update_date fields