Solved: Create unique id based on two columns - Qlik Community

Anonymous · ‎2019-11-21

I have the following situation and would like some help on what sort of data pipeline design I should follow.

I have a csv file that consists of several columns including, call it csv 1:

Region | Year | other columns ...

I have another csv file that also consists of several columns including, call it csv 2:

Region | Year | different other columns than csv1...

I would like to create a new key ID that based on grouping the region and the year.

For example

Region | Year | ID

74 | 2016 | 1

67 | 2017 | 2

My last target is to join/merge these two csv files based on the ID column generated.

I am aware that I need to store the values of these ID somewhere globally, so when I get to the second csv file to generate my ID column, I just need to check if the ID that corresponds to the region and year already exist. If not, then I create a new ID based for region-year pair.

Any help or insights is appreciated.

Anonymous · ‎2019-11-23

Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.

View solution in original post

TRF · ‎2019-11-21

Keep it simple, give an example for both inputs and expected result.

Anonymous · ‎2019-11-21

Hi,

Take one CSV file as input, you have to concatenate these 3 columns(Region, Year, ID) and create a new column as Key ID. Once it generated you can merge both CSV files. Hope this helps.

Anonymous · ‎2019-11-22

There you go :

Expected Input

2 csv files.

csv 1 has this schema format :

region id | country | year | % population | number of population

csv 2 has this schema format :

Expected Result

A one table that joins both csv1 and csv2 based on ( region id and combination). The reason I am joining based on two fields is because the region id is not unique and gets replicated couple times in the rows. for example

region id | country | year | ......

3 | africa | 2018 | ....

3 | africa | 2017 | ....

Therefore my expected output would be as follows

The region id for a country in csv1 is the same as the region id for a country in csv2. However, they are not listed in sorted order. And sometimes there are some countries that are specified in csv1 but not in csv2, I would have to discard these from the join process indeed.

Any help is appreciated

Anonymous · ‎2019-11-22

@cymric97

I agree with @TRF. It would be great if you could provide 5 sample records clearly for each input and expected output for those 5 records. We are seeing your data for the first time and if the description is not correct (like join condition in Expected result section), it will not yield right results.

Right now the description seems to be vague. But I am guessing that you are trying to do an inner join between two csv based on region id and country. You should be able to do it using tMap. Could you please share what is the error you are getting?

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Anonymous · ‎2019-11-23

Hi

Thanks for you reply and apologies on the lack of explanation I have provided before. After testing with some other simpler data sets I noticed that the strategy I was attempting to do to create the "inner join based on region id and year" was correct. However, what's creating the issue is the instability of the column orders when editing the schema of an imported csv file.

For example:

I am using tfileInputDelimited to import an existing csv file with this schema :

Region | Country | Year | GDP % | GDP per capita

When Editing the schema for that under Basic settings, the region column automatically reorders itself to become the third column in the list :

Country | Year |Region | GDP % | GDP per capita

Even when I attempt to adjust that and hit apply and save to save the schema, when I check it again

It gets back to this weird order of having Region as the third column.

I believe this is known bug in Talend 7.2 as I have researched this on the forum here? I am using a MAC OS High Sierra.

Anonymous · ‎2019-11-23

Never mind. I think the issue was resolved by upgrading from version 7.2.1 to 7.3.

Create unique id based on two columns

Studio

v5.x

v6.x

v7.x