Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
mohan_1105
Partner - Creator III
Partner - Creator III

Data Loss Issue in Talend Incremental Load with Oracle ATP and Parallel Processing

Hello Talend Community,

I’m experiencing a data loss issue in an incremental load job in Talend and would appreciate your insights to identify the root cause and recommend solutions. Here’s a detailed overview of my setup, the problem, and the possibilities I’ve considered:

I have a Talend job designed to handle three flows: full load, incremental load, and custom load, triggered by user-defined start and end dates. The focus of my issue is the incremental load, which processes data from an Oracle SaaS source (via SOAP API calls) and writes to an Oracle ATP target. The job flow is as follows:

  1. Initial Step: Capture the maximum end date from the main table (context.table_name) using a tOracleInput component.
  2. Insert/Update Flow: Process data between the max end date and the user-defined end date for inserts and updates, executed only in the first flow.
  3. Temp Table Load:
    • Truncate the temp table (context.table_name_L).
    • Pull all ID columns (e.g., FULFILL_LINE_ID) from the Oracle SaaS source via SOAP API calls.
    • Write these IDs to the _L temp table in Oracle ATP using parallel threads (set to 4) and tFlowToIterate for iteration.
    • The process uses parallel execution for both API calls and writes, with onSubjobOK triggers to sequence the temp table write and subsequent delete operation.
  4. Delete Operation: Use the following DELETE query in a tOracleOutput component to remove records from the main table that don’t exist in the temp table:
    sql
    DELETE FROM "+context.table_name+" t WHERE NOT EXISTS ( SELECT 1 FROM "+context.table_name+"_L l WHERE t.FULFILL_LINE_ID = l.FULFILL_LINE_ID )
    This delete operation starts after the temp table write succeeds, triggered by onSubjobOK.
  5. Post-Job: Update an audit table with the max date for the next incremental run.

AutoCommit is enabled for all database operations, ensuring each transaction commits immediately.

Problem:

I’m experiencing data loss in the incremental load, with the following specific observations:

  • Most of the time, data loss occurs at the beginning of the load.
  • Sometimes, the row count in the temp table (_L) is less than the source row count from the API.
  • In some cases, the row count in _L matches the source, but data is still deleted from the main table incorrectly.

The job fails if the insert/update flow errors, stopping subsequent steps (temp table write and delete), but the data loss persists even when the job completes successfully.

  • Has anyone experienced similar data loss issues in Talend incremental loads with parallel processing and Oracle ATP? What was the root cause, and how did you resolve it?
  • Are there known issues with tFlowToIterate or parallel tOracleOutput components that could lead to race conditions or data loss?
  • How can I better synchronize the 4 parallel threads writing to the _L table to prevent overwrites or missing records?
  • Should I adjust the parallel thread count, or use partitioning or temporary tables per thread to merge results into _L?
  • Are there best practices for validating SOAP API responses in Talend to ensure all source data is captured, especially at the start of the load?
  • Could the DELETE query’s NOT EXISTS subquery be improved to prevent incorrect deletions, or is the issue solely with _L data?

Job Design:

mohan_1105_0-1741190204055.png

 

Thanks,

Mohan

Labels (3)
0 Replies