Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
JuMacedo
Partner - Contributor III
Partner - Contributor III

Delete Duplicates with distinct IDs

Hi!

Thought about asking if anyone can think of a solution for my problem here, since I'm new to Qlik I'm not sure if there is a function that can do that job.

I have a dataset with many duplicated records, but every record has a different Id. I'd like to delete the duplicates and maintain just the first of the duplicated records. 

In this example, rows 2-6 are duplicated.

Row_number Id Start Station Departure Time Final Station Arrival Time DepartureTime_Retur ArrivalTime_Retur Count Changes score
1 bb945103 City1 07:09:00 City2 12:38:00 16:07:00 21:46:00 1 -16860
2 3d63b7b4 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
3 658a5c90 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
4 e12f4755 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
5 f0770ed2 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
6 a107b4e0 City1 07:09:00 City2 12:38:00 20:07:00 06:45:00 2 -48240
7 fc6fea72 City1 07:09:00 City2 12:38:00 21:07:00 05:42:00 2 -44460

 

 What I'd like to see instead is:

Row_number Id Start Station Departure Time Final Station Arrival Time DepartureTime_Retur ArrivalTime_Retur Count Changes score
1 bb945103 City1 07:09:00 City2 12:38:00 16:07:00 21:46:00 1 -16860
2 3d63b7b4 City1 07:09:00 City2 12:38:00 21:07:00 06:45:00 2 -48240
7 fc6fea72 City1 07:09:00 City2 12:38:00 21:07:00 05:42:00 2 -44460

 

Is there a way to achieve this? I'm attaching a small sample of data as well if there is someone who woul'd like to try a solution here.

I tried Load Distinct, but it doesn't work as I have distinct ID for every row.

 

 

Labels (2)
1 Solution

Accepted Solutions
stefanmereuta
Partner - Contributor
Partner - Contributor

This can be done in one load (adapting from @igoralcantara's answer):

 

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

 

 

View solution in original post

12 Replies
igoralcantara
Partner Ambassador/MVP
Partner Ambassador/MVP

Hello and welcome to Qlik. This is how I would do:

// Exclude the non distinct values
Tablename:
Load Distinct
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE];


// Add the min row_number
Inner Join(Tablename)
Load Distinct
Min(Row_number) as Row_number,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

// Add what else is missing
Inner Join(Tablename)
Load
Row_number,
id
From [YOUR SOURCE HERE];

Check out my latest posts at datavoyagers.net
rubenmarin

Hi, after loading the first table you can do an inner join to only keep the lower Row_number of each combination, like:

Inner Join ([DataTableName])
LOAD Min(Row_number) as Row_number
Resident [DataTableName]
Group By [All fields except Id];

 

Clement15
Partner - Specialist
Partner - Specialist

Hello, could this work?

 

Load 
max(Row_number),
max(Id),
Start Station,
Departure Time,
Final Station,
Arrival Time,
DepartureTime_Retur,
ArrivalTime_Retur,
Count,
Changes score
From ...
group by Start Station,Departure Time,Final Station,Arrival Time,DepartureTime_Retur,ArrivalTime_Retur,Count,Changes score

igoralcantara
Partner Ambassador/MVP
Partner Ambassador/MVP

This would not work because the Max Row_Number does not always links to the Max Id. Also, Id is not a number, Max does not work. It would be MaxString instead.

Check out my latest posts at datavoyagers.net
marcus_sommer

For the usual view-requirements you won't need the Row_number and Id and therefore the simplest approach would be just to remove them and loading the data distinct.

Another approach might be to combine the relevant fields within a first load, like:

F1 & '|' & F2 & '|' & F3 ... as Key

and within a second one you applies something like:

exists(Key) as Flag

and a third load may filter on the flag with: where Flag = 0;

stefanmereuta
Partner - Contributor
Partner - Contributor

This can be done in one load (adapting from @igoralcantara's answer):

 

Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;

 

 

JuMacedo
Partner - Contributor III
Partner - Contributor III
Author

Hi Marcus, thanks for the answer. I'll need the Id as it will be used to filter a table that comes next. The Row_number though won't be needed. So I guess this solution might not work. 

marcus_sommer

You may consider to change the order of script-statements within your ETL chain and/or to extract the id within a separate table to apply them in any mapping/exists-approaches.

Beside this the id's are all different - how will you ensure to remove/remain the needed ones? Or all id's exists within the other source, too - then which taken from there?

Therefore I suggest to review the sources and the data-quality at first before looking for technically solutions.

JuMacedo
Partner - Contributor III
Partner - Contributor III
Author

Thank you for the contributions both @igoralcantara and you. I decided to go for this solutions as it seemed to be the simplest, with just one load. But I guess what @igoralcantara would have worked as well since you adapted the script from his idea. 
This worked, so thanks!