topic Re: unique values extraction in Talend Studio

unique values extraction

Anonymous — Sat, 16 Nov 2024 03:41:09 GMT

Hi All,

need a help regarding extraction of unique values of a multiple columns individually in a table by using Talend/SQL.

EXAMPLE:

INPUT:

ID	NAME	SALARY	AGE
1	MOHAN	5000	27
2	AJAY	2700	26
3	SAI	5000	29
4	RAMA	3000	27
5	MOHAN	4500	29

EXPECTED OUTPUT:

ID	NAME	SALARY	AGE
1	MOHAN	5000	27
2	AJAY	2700	26
3	SAI	3000	29
4	RAMA	4500
5

Thanks in advance.

regards;

Bhargav

Re: unique values extraction

Anonymous — Fri, 03 Jan 2020 14:51:13 GMT

Hi,

You can either use tAggregaterow or tUniquerow to select the unique values where you need to pass only relevant fields as input to these components (you can use a tMap to select the right columns).

One thing I would like to say is that your input and output record do not have any correlation. For example, you were having two input records as shown below.

MOHAN	5000	27
MOHAN   4500    29

But you have taken the unique value as

MOHAN

5000

You might be joining records of two persons with name Mohan with different age to a single output record. Since the output record details are a combination of the earlier two records it may result in a totally new imaginary person.

So please double check how you would like to aggregate the input records.

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Re: unique values extraction

tnewbie — Fri, 03 Jan 2020 15:27:17 GMT

In addition to all the valid inferences from Nikhil, another approach i would suggest is:

1) Sort your data based on the column you want to check on (In your example Name)

2) Check the subsequent rows of data if they match the column value, if they match then ignore else carry forward. With this approach you would only process the first row of data for that column. However you are have to double ensure that your sort criteria is correct and valid

Eg. Input Data

SNo. Name Age

100 Mohan 23

101 Mike 45

102 Mohan 32

Sorted Data

SNo. Name Age

100 Mohan 23

102 Mohan 32

101 Mike 45

Process Detail

when the current row is Sno=100 you have nothing to compare so that will go through, when the current row is 102 you find that for the previous record Name= the name of the current record, so you can ignore that record, when you get to the next record Name=Mike which is different from Mohan so that will go through

This approach would greatly help in performance as well, because if the incoming data volume is huge, aggregator will quickly turn out to be performance hindrance. This is a age old technique where you compare rows on the fly... Hope that helps