tDBOutput (Snowflake) Dropping Records

PaulyWally · ‎2021-01-05

I have a Talend Project that moves data from MS SQL Server to Snowflake. Out of about 60 tables, there are 3 that are missing records.

Table 1:

2,248 source records

917 destination records

Table 2:

44,358 source records

44,314 destination records

Table 3:

2,915 source records

105 destination records

All the tDBOutput components are set to:

Table Action: NONE

Output Action: INSERT

I have looked at the query history in the Snowflake database. I found some irregular behavior during the staging and insert for these 3 tables. The tDBOutput component normally performs the following for each table:

Upload a staging file with all records to the Snowflake database. This is done using the Snowflake PUT command.
Create an empty temporary table that is a clone of the destination table.
Insert all records into the temporary table from the staging file. This is done using the Snowflake COPY INTO command. The tDBOutput component automatically uses these COPY INTO command options: on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)
Insert all records into the destination table from the staging table using INSERT INTO.
Remove the staging file. This is done using the Snowflake RM command.

However, on the 3 tables in question I see different behavior. There are 2 different COPY INTO commands being issued for each table. The 1st COPY INTO has the options:

on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)

The 2nd COPY INTO has the options:

validation_mode='return_all_errors' file_format=(field_optionally_enclosed_by='"'empty_field_as_null=true)

The sum of the record counts for both COPY INTO commands equal the record counts from the source table. Here are the details for each table:

Table 1:

2,248 source records

PUT <staging file>
CREATE <temp table>
COPY INTO <temp table> FROM <staging file> on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)
1. 917 records
COPY INTO <temp table> FROM <staging file>
validation_mode='return_all_errors' file_format=(field_optionally_enclosed_by='"'empty_field_as_null=true)
1. 1,331 records
INSERT INTO <destination table> FROM <temp table>
1. 917 records
RM <staging file>

Table 2:

44,358 source records

PUT <staging file>
CREATE <temp table>
COPY INTO <temp table> FROM <staging file> on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)
1. 44,314 records
COPY INTO <temp table> FROM <staging file> validation_mode='return_all_errors' file_format=(field_optionally_enclosed_by='"'empty_field_as_null=true)
1. 45 records
INSERT INTO <destination table> FROM <temp table>
1. 44,314 records
RM <staging file>

Note: the sum of the COPY INTO records for this table is 44,359.

Table 3:

2,915 source records

PUT <staging file>
CREATE <temp table>
COPY INTO <temp table> FROM <staging file> on_error='CONTINUE' file_format=( field_optionally_enclosed_by='"' empty_field_as_null=true)
1. 105 records
COPY INTO <temp table> FROM <staging file> validation_mode='return_all_errors' file_format=(field_optionally_enclosed_by='"'empty_field_as_null=true)
1. 2,810 records
INSERT INTO <destination table> FROM <temp table>
1. 105 records
RM <staging file>

At this point, I am mostly troubleshooting in the dark. Without having those staging files, it is difficult to determine where and why the records are being dropped. But the tDBOutput component is deleting the staging files before I can grab them.

I am continuing to troubleshoot. But a lot of it is just guessing, and trying to locate some data irregularities in the source record sets with little information as to where exactly the problem might be. So I am also hoping someone here might be able to offer some suggestions.

Thanks in advance!

Edit:

I have discovered all of the issues that caused the dropped records. They all essentially came down to bad data and/or a mismatch between the source table schema and the destination table schema. It was a tedious task to locate all of the issues. If I had the staging files, I could have located the issues much more quickly. I am hoping someone might be able to chime in with some ideas that could help expedite troubleshooting something like this. Perhaps there is a way to retrieve the staging files that I am unaware of.

Anonymous · ‎2021-01-06

Hi

For debugging, link tSnowFlakeOutput to tLogRow with reject connector, you will see if any records are rejected and view the detailed error message.

Regards

Shong

JackStrong · ‎2021-01-19

Hi.

I don't know if you resolved issue. If didn't, maybe it will help you.

We met similar behaviour . And here is what was the reason.

In the past JDBC connector was modified (I don't know for which version) and we observed different behaviour before and after JDBC modification/update.

Please check in Snowflake query history how the temporary table is created.

Previously (old JDBC driver) it was:

CREATE TEMPORARY TABLE "TEMP_TABLE_NAME" AS SELECT "updated_col1","updated_col2",... FROM "TARGET_TABLE_NAME" WHERE FALSE

For newer JDBC connector it looks like this:

CREATE TEMPORARY TABLE "TEMP_TABLE_NAME" LIKE "TARGET_TABLE_NAME"

and then all not necessary (not updated) columns are dropped from TEMP_TABLE_NAME.

But the second approach (new driver) causes that the order of the column can be different in TEMP_TABLE_NAME and TARGET_TABLE_NAME.

When you inserting the data (INSERT option) without changing columns order between Input (stream/link) and Output components there is no issue.

But when you update the data (UPDATE option) (not all columns for output table (Output component) are populated/updated) the order of the columns in Input (stream/link) component) can be different than in Output.

For such case there is a issue because COPY INTO <temp table> FROM <staging file> receive diffrent structures for temp table and staging stream. It means that some or all records can be dropped because of constraints, data types etc.

And probably the same issue there will be (I didn't test it) if you insert data in different columns order for Input (stream/link) and Output.

Simply, the issue is because the structure of input set of data is different than Output object structur (different columns order).

This is how you can check how exactly looks like your input set of data for COPY command (COPY INTO <temp table> FROM <staging file> ) and what are the error messages for dropped records (to validate the results you need to be logged as the same user as your ETL process):

select * from table(result_scan('QUERY_ID'));

QUERY_ID is the id od second COPY INTO statement (this one with validation_mode='return_all_errors' ).

Snowflake

Sql Server Database

Talend Data Integration

v7.x