Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
See why IDC MarketScape names Qlik a 2025 Leader! Read more
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Checking duplicate in the content of 2 or more records

I have a stream of files coming continuously from a server. And I need to load them one by one in HDFS. Now that I have multiple files coming, I need to check if content of 2 files are same, then I need to ignore the content of that file and reject it.

Could anyone please help me in achieving this one.

 

Thanks

Labels (3)
8 Replies
TRF
Champion II
Champion II

As a trivial solution:

- tFileInputRaw to read both files as a single column + tMap to make an inner join.

- If number of records on the result = the number on the input, there is no difference.

Anonymous
Not applicable
Author

Hi,

 

You can create a route mediation with a cfile and a cIdempotentConsumer component.

 

https://help.talend.com/reader/YtJvt25ynUgZ~sfL~L5dAg/2Gc7F~jW8E2LUycJ0xpLdQ

 

Eric

Anonymous
Not applicable
Author

Thanks for replying.

 

I tried the approach, taking two tfileinputraw for 2 different files and in the tfileinputraw we have only object as the datatype and couldn't take the row_count.

 

I have the situation like this:

i have multiple records incoming from the server which I need to validate if the contents are same or not. May be using tfileinputraw we have limitation of taking the number of input files.

 

Thanks for the suggestion and if you could please elaborate the solution if I didn't get the approach.

TRF
Champion II
Champion II

Sorry, I mean tFileInputFullRow instead of tFileInputRaw

Anonymous
Not applicable
Author

Thanks for replying. 

 

I looking for a solution in Talend Open Studio. 

 

Thanks

TRF
Champion II
Champion II

tFileInputFullRow is available in TOS
Anonymous
Not applicable
Author

Sorry it went to you, I was sending that message to Eric.

 

Anyways, in tFileInputFullRow  will contain will only one file. what should be the approach if I am doing the comparison between some 50-60 files.

 

thanks

TRF
Champion II
Champion II

1 parent job to iterate over the file list (tFileList) + 1child job (called using tRunJob) receiving the current filename (to avoid an auto comparison) to iterate over the same file list and compare files 2/2.

Need some thinking for optimization (don't compare file1/file2 and file2/file1 for example).