Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a Talend Project that moves data from MS SQL Server to Snowflake. Out of about 60 tables, there are 3 that are missing records.
Table 1:
2,248 source records
917 destination records
Table 2:
44,358 source records
44,314 destination records
Table 3:
2,915 source records
105 destination records
All the tDBOutput components are set to:
Table Action: NONE
Output Action: INSERT
I have looked at the query history in the Snowflake database. I found some irregular behavior during the staging and insert for these 3 tables. The tDBOutput component normally performs the following for each table:
However, on the 3 tables in question I see different behavior. There are 2 different COPY INTO commands being issued for each table. The 1st COPY INTO has the options:
on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)
The 2nd COPY INTO has the options:
validation_mode='return_all_errors' file_format=(field_optionally_enclosed_by='"'empty_field_as_null=true)
The sum of the record counts for both COPY INTO commands equal the record counts from the source table. Here are the details for each table:
Table 1:
2,248 source records
Table 2:
44,358 source records
Note: the sum of the COPY INTO records for this table is 44,359.
Table 3:
2,915 source records
At this point, I am mostly troubleshooting in the dark. Without having those staging files, it is difficult to determine where and why the records are being dropped. But the tDBOutput component is deleting the staging files before I can grab them.
I am continuing to troubleshoot. But a lot of it is just guessing, and trying to locate some data irregularities in the source record sets with little information as to where exactly the problem might be. So I am also hoping someone here might be able to offer some suggestions.
Thanks in advance!
Edit:
I have discovered all of the issues that caused the dropped records. They all essentially came down to bad data and/or a mismatch between the source table schema and the destination table schema. It was a tedious task to locate all of the issues. If I had the staging files, I could have located the issues much more quickly. I am hoping someone might be able to chime in with some ideas that could help expedite troubleshooting something like this. Perhaps there is a way to retrieve the staging files that I am unaware of.
Hi
For debugging, link tSnowFlakeOutput to tLogRow with reject connector, you will see if any records are rejected and view the detailed error message.
Regards
Shong
Hi.
I don't know if you resolved issue. If didn't, maybe it will help you.
We met similar behaviour . And here is what was the reason.
In the past JDBC connector was modified (I don't know for which version) and we observed different behaviour before and after JDBC modification/update.
Please check in Snowflake query history how the temporary table is created.
Previously (old JDBC driver) it was:
CREATE TEMPORARY TABLE "TEMP_TABLE_NAME" AS SELECT "updated_col1","updated_col2",... FROM "TARGET_TABLE_NAME" WHERE FALSE
For newer JDBC connector it looks like this:
CREATE TEMPORARY TABLE "TEMP_TABLE_NAME" LIKE "TARGET_TABLE_NAME"
and then all not necessary (not updated) columns are dropped from TEMP_TABLE_NAME.
But the second approach (new driver) causes that the order of the column can be different in TEMP_TABLE_NAME and TARGET_TABLE_NAME.
When you inserting the data (INSERT option) without changing columns order between Input (stream/link) and Output components there is no issue.
But when you update the data (UPDATE option) (not all columns for output table (Output component) are populated/updated) the order of the columns in Input (stream/link) component) can be different than in Output.
For such case there is a issue because COPY INTO <temp table> FROM <staging file> receive diffrent structures for temp table and staging stream. It means that some or all records can be dropped because of constraints, data types etc.
And probably the same issue there will be (I didn't test it) if you insert data in different columns order for Input (stream/link) and Output.
Simply, the issue is because the structure of input set of data is different than Output object structur (different columns order).
This is how you can check how exactly looks like your input set of data for COPY command (COPY INTO <temp table> FROM <staging file> ) and what are the error messages for dropped records (to validate the results you need to be logged as the same user as your ETL process):
select * from table(result_scan('QUERY_ID'));
QUERY_ID is the id od second COPY INTO statement (this one with validation_mode='return_all_errors' ).