topic Re: Remove all duplicate rows in Talend Studio

Remove all duplicate rows

Anonymous — Fri, 11 Aug 2017 03:11:59 GMT

Hi all,

I'm new to Talend and I have a question,

Have any component let me remove all duplicate rows?

Sample data:

Name, Age

HieuDoan, 15

LinhNa, 16

HieuDoan, 20

NamL, 17

I have tried tUniqRow but it still keep the first duplicate row and not remove all duplicate:

The ouput I got with tUniqRow :

HieuDoan, 15

LinhNa, 16

NamL, 17

The output i need:

LinhNa, 16

NamL, 17

Can you give me a suggest? Thanks

Re: Remove all duplicate rows

Anonymous — Fri, 11 Aug 2017 04:52:21 GMT

Hi
Not a component can achieve it directly, you can cache the result into memory and do an inner join in the next subjob to get the expected result. eg:
tfileinputdelimimited--main--tuniqrow--unique--thashoutput1
--duplicated--thashoutput2
|
onsubjobok
|
tHashinput1--main--tmap--tlogrow
|
lookup
|
thashinput2
thashinput1: read data from thashoutput1
thashinput2: read data from thashoutput2

on tMap: do an inner join based on the keys you defined on tuniqrow, set the 'Catch lookup inner join reject' option as true in the output table.

Regards
Shong

Re: Remove all duplicate rows

Anonymous — Fri, 11 Aug 2017 07:39:10 GMT

Hi Shong,
My data source is Redshift and it have a lots data, where can I cache result into to use later ?

Re: Remove all duplicate rows

Anonymous — Fri, 11 Aug 2017 08:01:51 GMT

Store the data on disk instead of memory if there are lots of data.

Re: Remove all duplicate rows

vapukov — Sat, 12 Aug 2017 09:45:43 GMT

Talend do absolutely correct - tUniqRow DO NOT REMOVE all rows with duplicated, it is remove duplicates and leave original unique value

Your example - illustrate it great

if You want remove all rows which value in table more than 1 time

it could be easy done by SQL query, if You prefer Talend, You can do

tRedshiftInput -> tAggregate (count by value) -> tFilterRwo (condition - count == 1)

result will be as You wish :

LinhNa, 16
NamL, 17

as addition:

HieuDoan, 15 and HieuDoan, 20

generally also 2 people with same name, but different age :-), but this is other topic

Re: Remove all duplicate rows

wangbinlxx — Sun, 13 Aug 2017 14:24:26 GMT

If your source is Redshift, why not do this in DB, with a simple query?

SELECT *

FROM TABLE

MINUS

SELECT distinct *

FROM TABLE;

Depending on the size of the table, you need to tune the query. However, most likely you can get the best performance doing it within DB.

Re: Remove all duplicate rows

Anonymous — Mon, 14 Aug 2017 05:30:09 GMT

Thank Vapukov,
How to do it by SQL query?

Re: Remove all duplicate rows

wangbinlxx — Mon, 14 Aug 2017 06:59:10 GMT

Did you try?

SELECT *

FROM TABLE

MINUS

SELECT distinct *

FROM TABLE;

Re: Remove all duplicate rows

vapukov — Mon, 14 Aug 2017 07:10:40 GMT

in general case (if we accept You are try to exclude all names which only 1 in database, and not include any of records with more than 1 value)

it work as:

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            -- next could be changed for MAX depending from logic
            MIN(id) 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.id = t2.id

code above mean - table have primary key and it name "id", if table do not have primary key, logic could be different - for example

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            name 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.name = t2.name

it could be adjusted for real structure, because now I type "theoretical code"