Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Remove all duplicate rows

Hi all,

I'm new to Talend and I have a question,

Have any component let me remove all duplicate rows?

Sample data:

Name, Age

HieuDoan, 15

LinhNa, 16

HieuDoan, 20

NamL, 17

 

I have tried tUniqRow but it still keep the first duplicate row and not remove all duplicate:

The ouput I got with tUniqRow :

HieuDoan, 15

LinhNa, 16

NamL, 17

 

The output i need:

LinhNa, 16

NamL, 17

 

Can you give me a suggest? Thanks

Labels (2)
8 Replies
Anonymous
Not applicable
Author

Hi
Not a component can achieve it directly, you can cache the result into memory and do an inner join in the next subjob to get the expected result. eg:
tfileinputdelimimited--main--tuniqrow--unique--thashoutput1
--duplicated--thashoutput2
           |
onsubjobok
          |
tHashinput1--main--tmap--tlogrow
                                 |
                           lookup
                                |
                      thashinput2
thashinput1: read data from thashoutput1
thashinput2: read data from thashoutput2

on tMap: do an inner join based on the keys you defined on tuniqrow, set the 'Catch lookup inner join reject' option as true in the output table.

Regards
Shong

Anonymous
Not applicable
Author

Hi Shong,
My data source is Redshift and it have a lots data, where can I cache result into to use later ?
Anonymous
Not applicable
Author

Store the data on disk instead of memory if there are lots of data.

vapukov
Master II
Master II

Talend do absolutely correct - tUniqRow DO NOT REMOVE all rows with duplicated, it is remove duplicates and leave original unique value

 

Your example - illustrate it great

 

if You want remove all rows which value in table more than 1 time

it could be easy done by SQL query, if You prefer Talend, You can do

 

tRedshiftInput -> tAggregate (count by value) -> tFilterRwo (condition - count == 1)

result will be as You wish :

 

LinhNa, 16
NamL, 17

 

 

as addition:

HieuDoan, 15 and HieuDoan, 20

generally  also 2 people with same name, but different age :-), but this is other topic

wangbinlxx
Creator
Creator

If your source is Redshift, why not do this in DB, with a simple query?

 

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

 

Depending on the size of the table, you need to tune the query. However, most likely you can get the best performance doing it within DB.  

 

 

Anonymous
Not applicable
Author

Thank Vapukov,
How to do it by SQL query?

wangbinlxx
Creator
Creator

Did you try?

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

vapukov
Master II
Master II

in general case (if we accept You are try to exclude all names which only 1 in database, and not include any of records with more than 1 value)

it work as:

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            -- next could be changed for MAX depending from logic
            MIN(id) 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.id = t2.id



code above mean - table have primary key and it name  "id", if table do not have primary key, logic could be different - for example

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            name 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.name = t2.name

 

it could be adjusted for real structure, because now I type "theoretical code" 0683p000009MACn.png