How to filter the row data by comparing with each ... - Qlik Community

Anonymous · ‎2019-01-31

Hi there,

I’d like to update original data with delta data.
(The original data is named “sample_org.tsv”, the delta is “sample_delta.tsv”)

Original Data

id name timestamp
------------------------------------------
1 Jorn 2019-05-11 12:49:01
2 Ken 2019-05-11 13:39:11
3 Smith 2019-05-11 17:53:33
4 Taro 2019-05-12 13:42:21
5 Hanako 2019-05-12 17:55:06

Delta Data
2 Kana 2019-05-13 11:54:21
3 Kenji 2019-05-13 11:54:51
6 James 2019-05-13 11:55:06

First of all, I united the both data files into one file with tUnite component.
After that, I separated the data into two groups using tMatchGroup; Unique and Match(It can be named Duplicate).
(The unique data file is named “sample_unique.tsv”, the duplicated is “sample_dup”)

Unique Data
5 Hanako 2019-05-12 17:55:06
6 James 2019-05-13 11:55:06
4 Taro 2019-05-12 13:42:21
1 Jorn 2019-05-11 12:49:01

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33

And then, I want to filter the Duplicate Data on the same ID with timestamp.
I want to delete the old data.

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11 deleted because this row data is older than the row data above.
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33 deleted because this row data is older than the row data above

After this filtering, I want to unite the Unique Data with the processed data above.

What component and how can I realize this on Talend Studio??

Anonymous · ‎2019-01-31

Hi,

Your intention is good and interesting one. But if you are trying to do it in a single subjob, it may not be possible as it will become cyclic dependency.

I hope you are pretty clear till segregating Unique and Duplicate data and I would advise to stop your subjob at this logical point. You can capture the output either to interim tHashOutput or interim files for futher processing.

The next step is to filter the old data in duplicate data set. You can pass the data to taggregaterow to identify the oldest or latest record using min and max functions. Once you identify this dataset, you can use it as lookup where you can join with duplicate dataset (main flow) using inner join condition. In this way, you will be able to get the unique record from duplicate dataset.

After that you can use tUnite to join according to your requirement.

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

Anonymous · ‎2019-02-01

Thank you for quick reply and useful advise!

By using tAggregationRow, I could filter the duplicate data set into unique date set with the newest rows.

2 Ken 2019-05-13 11:54:21
3 Smith 2019-05-13 11:54:51

After this filtering, I'm going to do unite the the unique data set created by the tMatchGroup and the unique data set created by tAggregationRow as I mentioned above.

If there is any problem with this process, please give me advise again.

flow image_02.PNG

Anonymous · ‎2019-02-01

Hi,

Please let me ask one more question.

This is very simple one.

After executing the tMatchGroup component, the I found the extra columns added to my output data set like GID, GRP_SIZE, MASTER, SCORE, GRP_QUALITY.

How can I trim these colums after tMatchGroup??

What component should I use for this?

Thanks in advance.

tMatchGroup configuration.PNG

Anonymous · ‎2019-02-01

Hi,

You can pass the columns to a tMap and in the output section, you can ignore the unwanted columns.

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

akumar2301 · ‎2019-02-01

Another simple way to Implement the SCD Type 1

Merge Main and Delta --> Sort on ID and TimeStamp --> aggregate --> Output

How to filter the row data by comparing with each row

Studio

Talend Data Integration

v7.x