Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all,
I'm new to Talend and I have a question,
Have any component let me remove all duplicate rows?
Sample data:
Name, Age
HieuDoan, 15
LinhNa, 16
HieuDoan, 20
NamL, 17
I have tried tUniqRow but it still keep the first duplicate row and not remove all duplicate:
The ouput I got with tUniqRow :
HieuDoan, 15
LinhNa, 16
NamL, 17
The output i need:
LinhNa, 16
NamL, 17
Can you give me a suggest? Thanks
Hi
Not a component can achieve it directly, you can cache the result into memory and do an inner join in the next subjob to get the expected result. eg:
tfileinputdelimimited--main--tuniqrow--unique--thashoutput1
--duplicated--thashoutput2
|
onsubjobok
|
tHashinput1--main--tmap--tlogrow
|
lookup
|
thashinput2
thashinput1: read data from thashoutput1
thashinput2: read data from thashoutput2
on tMap: do an inner join based on the keys you defined on tuniqrow, set the 'Catch lookup inner join reject' option as true in the output table.
Regards
Shong
Talend do absolutely correct - tUniqRow DO NOT REMOVE all rows with duplicated, it is remove duplicates and leave original unique value
Your example - illustrate it great
if You want remove all rows which value in table more than 1 time
it could be easy done by SQL query, if You prefer Talend, You can do
tRedshiftInput -> tAggregate (count by value) -> tFilterRwo (condition - count == 1)
result will be as You wish :
LinhNa, 16 NamL, 17
as addition:
HieuDoan, 15 and HieuDoan, 20
generally also 2 people with same name, but different age :-), but this is other topic
If your source is Redshift, why not do this in DB, with a simple query?
SELECT *
FROM TABLE
MINUS
SELECT distinct *
FROM TABLE;
Depending on the size of the table, you need to tune the query. However, most likely you can get the best performance doing it within DB.
Thank Vapukov,
How to do it by SQL query?
Did you try?
SELECT *
FROM TABLE
MINUS
SELECT distinct *
FROM TABLE;
in general case (if we accept You are try to exclude all names which only 1 in database, and not include any of records with more than 1 value)
it work as:
SELECT t1.* FROM table_name t1 INNER JOIN ( SELECT -- next could be changed for MAX depending from logic MIN(id) FROM table_name GROUP BY "name" HAVING count(*) = 1 ) t2 ON t1.id = t2.id
code above mean - table have primary key and it name "id", if table do not have primary key, logic could be different - for example
SELECT t1.* FROM table_name t1 INNER JOIN ( SELECT name FROM table_name GROUP BY "name" HAVING count(*) = 1 ) t2 ON t1.name = t2.name
it could be adjusted for real structure, because now I type "theoretical code"