Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Comparing more than one column from two different files in tmap

I/p:Two text files having 5 columns each

Schema:Both files having same schema

O/p: Need to compare all 5 columns for each row and assign a flag if any of the column is not matching.

Approach taken:

Using t map on a key and trying to check each column using .equals() method and assigning values to a flag based on outcome of comparison.

 

Issue:

Not able to write the statement to implement the approach.

Ask:

Need pointers to write the statement or suggestion if i should take another approach.

 

 

Labels (2)
1 Solution

Accepted Solutions
vapukov
Master II
Master II

🙂

 

for what?!!!!! 

 

seriously - without understanding full picture, not possible to make a proper solution!

 

is your program (application, service and etc) - work with plain csv files? 

 

example1 - pure CDC:

- MySQL, PostgreSQL - proper CDC technologies already available on market for free

- SQL Server, Oracle and etc - in enterprise versions, but could be easily realized by triggers 

catch transactions (filtered by table if need), send to Kafka, parse by many subscribers

 

example 2:

load both files in any database ( including local SQLite) and all CDC between 2 files it is just a 3 SQL queries

 

example 3:

use JDBC for csv - again 3 simple queries

 

example 4:

pure Talend - 3 flows:

- find id from 1st files which is not 2nd - for INSERT - 1st file in main, 2nd is lookup, INNER JOIN catch rejected

- find id in 2nd but not in 1st - for DELETE- 2st file in main, 1nd is lookup, INNER JOIN catch rejected

- INNER JOIN in tMAP by id and filter where

(row1.col1 != row2.col1 || row1.col2 != row2.col2 || row1.col3 != row2.col3 || row1.col4 != row2.col4 || row1.col5 != row2.col5)

instead of != possible to use .equals() for string and etc

 

 

 

View solution in original post

6 Replies
Anonymous
Not applicable
Author

Hi,

 

    If my understanding about your use case is correct, you are trying to match the records from two files having same schema and you want to pick the matched and non matched records between main flow and lookup flow.

 

     If my assumption is right, then the most easy way to do the matching is to add an inner join between the columns of main and lookup flow. If there are matched records (where matching is done on all 5 columns), then you can send the data to output flow and you can add the flag value as "Y". You can separate all non-matched records of inner join by creating another output flow and marking the inner join reject as true for that flow. Here you can add the flag value as "N"

0683p000009M26p.pngFlag value set as true for inner join reject

 

 

      You can store the output to separate file or tHashOutput for later process. If you want to merge the output in separate Sub job by reading the output data from both output flows and merge them using tUnite component.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

vapukov
Master II
Master II

Hi Nikhil,

 

think it needs little more complicated logic, because of inner join lookup (catch rejected):

  • catch rejected rows only from Main flow (not from lookup)
  • what if a different number of rows? for example, lookup file contains more rows than Main

 

so, first need to clarify the goal and environment and after think -how to better approach the goal 0683p000009MACn.png

Anonymous
Not applicable
Author

Thanks Vapukov,

Intention is to implement CDC.I have open studio version which doesn't have  CDC components.

Moreover CDC need to be implemented on files ,No DB involved.

 

Trying to used to identify unmatched records from both files in a single job.

 

I could implement logic to get the unmatched from one file using option provided by Nikhil, but not from both i/p files.

 

Also by creating two sub jobs i could get unmatched from both file, but  to check if there is any update on columns except key field what should i do?

I am thinking to write a variable in tmap and compare each column by using some function(I don't know which function i should use here to compare).

vapukov
Master II
Master II

let talk about "CDC"

 

  • how is all look?
  • why 2 files?
  • what real source of information?
  • what size of files?
  • for any CDC you must have primary key, do you have them in files?

 

 

Anonymous
Not applicable
Author

Yes I do have primary key.
Size is less only.
Files created with some sample data and trying to implement CDC.

vapukov
Master II
Master II

🙂

 

for what?!!!!! 

 

seriously - without understanding full picture, not possible to make a proper solution!

 

is your program (application, service and etc) - work with plain csv files? 

 

example1 - pure CDC:

- MySQL, PostgreSQL - proper CDC technologies already available on market for free

- SQL Server, Oracle and etc - in enterprise versions, but could be easily realized by triggers 

catch transactions (filtered by table if need), send to Kafka, parse by many subscribers

 

example 2:

load both files in any database ( including local SQLite) and all CDC between 2 files it is just a 3 SQL queries

 

example 3:

use JDBC for csv - again 3 simple queries

 

example 4:

pure Talend - 3 flows:

- find id from 1st files which is not 2nd - for INSERT - 1st file in main, 2nd is lookup, INNER JOIN catch rejected

- find id in 2nd but not in 1st - for DELETE- 2st file in main, 1nd is lookup, INNER JOIN catch rejected

- INNER JOIN in tMAP by id and filter where

(row1.col1 != row2.col1 || row1.col2 != row2.col2 || row1.col3 != row2.col3 || row1.col4 != row2.col4 || row1.col5 != row2.col5)

instead of != possible to use .equals() for string and etc