Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Create unique id based on two columns

I have the following situation and would like some help on what sort of data pipeline design I should follow.

 

I have a csv file that consists of several columns including, call it csv 1:

Region | Year | other columns ...

I have another csv file that also consists of several columns including, call it csv 2:

Region | Year | different other columns than csv1...

 

I would like to create a new  key ID that based on grouping the region and the year. 

For example

Region | Year | ID

74 | 2016 | 1 

67 | 2017 | 2 

 

My last target is to join/merge these two csv files based on the ID column generated.

I am aware that I need to store the values of these ID somewhere globally, so when I get to the second csv file to generate my ID column, I just need to check if the ID that corresponds to the region and year already exist. If not, then I create a new ID based for region-year pair.

 

Any help or insights is appreciated.

Labels (4)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.

View solution in original post

6 Replies
TRF
Champion II

Keep it simple, give an example for both inputs and expected result.

Anonymous
Not applicable
Author

Hi,

Take one CSV file as input, you have to concatenate these 3 columns(Region, Year, ID) and create a new column as Key ID. Once it generated you can merge both CSV files. Hope this helps.

Anonymous
Not applicable
Author

There you go :

 

Expected Input 

2 csv files. 

csv 1 has this schema format :

region id | country | year | % population | number of population 

 

csv 2 has this schema format :

region id | country | year | gdp | gdp per capita | % current price 

 

Expected Result

A one table that joins both csv1 and csv2 based on ( region id and combination). The reason I am joining based on two fields is because the region id is not unique and gets replicated couple times in the rows. for example 

region id | country | year | ......

3 | africa | 2018 | ....

3 | africa | 2017 | ....

 

Therefore my expected output would be as follows 

region id | country | year | % population | number of population | gdp | gdp per capita | % current price 

 

The region id for a country in csv1 is the same as the region id for a country in csv2. However, they are not listed in sorted order. And sometimes there are some countries that are specified in csv1 but not in csv2, I would have to discard these from the join process indeed.

Any help is appreciated 

 

Anonymous
Not applicable
Author

@cymric97  

 

I agree with @TRF. It would be great if you could provide 5 sample records clearly for each input and expected output for those 5 records. We are seeing your data for the first time and if the description is not correct (like join condition in Expected result section), it will not yield right results.

 

Right now the description seems to be vague. But I am guessing that you are trying to do an inner join between two csv based on region id and country. You should be able to do it using tMap. Could you please share what is the error you are getting?

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Anonymous
Not applicable
Author

Hi 

 

Thanks for you reply and apologies on the lack of explanation I have provided before. After testing with some other simpler data sets I noticed that the strategy I was attempting to do to create the "inner join based on region id and year" was correct. However, what's creating the issue is the instability of the column orders when editing the schema of an imported csv file. 

 

For example:

I am using tfileInputDelimited to import an existing csv file with this schema :

Region | Country | Year | GDP % | GDP per capita 

 

When Editing the schema for that under Basic settings, the region column automatically reorders itself to become the third column in the list :

Country | Year |Region | GDP % | GDP per capita 

 

Even when I attempt to adjust that and hit apply and save to save the schema, when I check it again 

It gets back to this weird order of having Region as the third column. 

 

I believe this is known bug in Talend 7.2 as I have researched this on the forum here? I am using a MAC OS High Sierra.

 

 

 

 

Anonymous
Not applicable
Author

Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.