
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Create unique id based on two columns
I have the following situation and would like some help on what sort of data pipeline design I should follow.
I have a csv file that consists of several columns including, call it csv 1:
Region | Year | other columns ...
I have another csv file that also consists of several columns including, call it csv 2:
Region | Year | different other columns than csv1...
I would like to create a new key ID that based on grouping the region and the year.
For example
Region | Year | ID
74 | 2016 | 1
67 | 2017 | 2
My last target is to join/merge these two csv files based on the ID column generated.
I am aware that I need to store the values of these ID somewhere globally, so when I get to the second csv file to generate my ID column, I just need to check if the ID that corresponds to the region and year already exist. If not, then I create a new ID based for region-year pair.
Any help or insights is appreciated.
Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Keep it simple, give an example for both inputs and expected result.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Take one CSV file as input, you have to concatenate these 3 columns(Region, Year, ID) and create a new column as Key ID. Once it generated you can merge both CSV files. Hope this helps.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There you go :
Expected Input
2 csv files.
csv 1 has this schema format :
region id | country | year | % population | number of population
csv 2 has this schema format :
region id | country | year | gdp | gdp per capita | % current price
Expected Result
A one table that joins both csv1 and csv2 based on ( region id and combination). The reason I am joining based on two fields is because the region id is not unique and gets replicated couple times in the rows. for example
region id | country | year | ......
3 | africa | 2018 | ....
3 | africa | 2017 | ....
Therefore my expected output would be as follows
region id | country | year | % population | number of population | gdp | gdp per capita | % current price
The region id for a country in csv1 is the same as the region id for a country in csv2. However, they are not listed in sorted order. And sometimes there are some countries that are specified in csv1 but not in csv2, I would have to discard these from the join process indeed.
Any help is appreciated

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with @TRF. It would be great if you could provide 5 sample records clearly for each input and expected output for those 5 records. We are seeing your data for the first time and if the description is not correct (like join condition in Expected result section), it will not yield right results.
Right now the description seems to be vague. But I am guessing that you are trying to do an inner join between two csv based on region id and country. You should be able to do it using tMap. Could you please share what is the error you are getting?
Warm Regards,
Nikhil Thampi
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Thanks for you reply and apologies on the lack of explanation I have provided before. After testing with some other simpler data sets I noticed that the strategy I was attempting to do to create the "inner join based on region id and year" was correct. However, what's creating the issue is the instability of the column orders when editing the schema of an imported csv file.
For example:
I am using tfileInputDelimited to import an existing csv file with this schema :
Region | Country | Year | GDP % | GDP per capita
When Editing the schema for that under Basic settings, the region column automatically reorders itself to become the third column in the list :
Country | Year |Region | GDP % | GDP per capita
Even when I attempt to adjust that and hit apply and save to save the schema, when I check it again
It gets back to this weird order of having Region as the third column.
I believe this is known bug in Talend 7.2 as I have researched this on the forum here? I am using a MAC OS High Sierra.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.
