Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi!
Thought about asking if anyone can think of a solution for my problem here, since I'm new to Qlik I'm not sure if there is a function that can do that job.
I have a dataset with many duplicated records, but every record has a different Id. I'd like to delete the duplicates and maintain just the first of the duplicated records.
In this example, rows 2-6 are duplicated.
Row_number | Id | Start Station | Departure Time | Final Station | Arrival Time | DepartureTime_Retur | ArrivalTime_Retur | Count Changes | score |
1 | bb945103 | City1 | 07:09:00 | City2 | 12:38:00 | 16:07:00 | 21:46:00 | 1 | -16860 |
2 | 3d63b7b4 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 06:45:00 | 2 | -48240 |
3 | 658a5c90 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 06:45:00 | 2 | -48240 |
4 | e12f4755 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 06:45:00 | 2 | -48240 |
5 | f0770ed2 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 06:45:00 | 2 | -48240 |
6 | a107b4e0 | City1 | 07:09:00 | City2 | 12:38:00 | 20:07:00 | 06:45:00 | 2 | -48240 |
7 | fc6fea72 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 05:42:00 | 2 | -44460 |
What I'd like to see instead is:
Row_number | Id | Start Station | Departure Time | Final Station | Arrival Time | DepartureTime_Retur | ArrivalTime_Retur | Count Changes | score |
1 | bb945103 | City1 | 07:09:00 | City2 | 12:38:00 | 16:07:00 | 21:46:00 | 1 | -16860 |
2 | 3d63b7b4 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 06:45:00 | 2 | -48240 |
7 | fc6fea72 | City1 | 07:09:00 | City2 | 12:38:00 | 21:07:00 | 05:42:00 | 2 | -44460 |
Is there a way to achieve this? I'm attaching a small sample of data as well if there is someone who woul'd like to try a solution here.
I tried Load Distinct, but it doesn't work as I have distinct ID for every row.
This can be done in one load (adapting from @igoralcantara's answer):
Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;
Hello and welcome to Qlik. This is how I would do:
// Exclude the non distinct values
Tablename:
Load Distinct
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE];
// Add the min row_number
Inner Join(Tablename)
Load Distinct
Min(Row_number) as Row_number,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;
// Add what else is missing
Inner Join(Tablename)
Load
Row_number,
id
From [YOUR SOURCE HERE];
Hi, after loading the first table you can do an inner join to only keep the lower Row_number of each combination, like:
Inner Join ([DataTableName])
LOAD Min(Row_number) as Row_number
Resident [DataTableName]
Group By [All fields except Id];
Hello, could this work?
Load
max(Row_number),
max(Id),
Start Station,
Departure Time,
Final Station,
Arrival Time,
DepartureTime_Retur,
ArrivalTime_Retur,
Count,
Changes score
From ...
group by Start Station,Departure Time,Final Station,Arrival Time,DepartureTime_Retur,ArrivalTime_Retur,Count,Changes score
This would not work because the Max Row_Number does not always links to the Max Id. Also, Id is not a number, Max does not work. It would be MaxString instead.
For the usual view-requirements you won't need the Row_number and Id and therefore the simplest approach would be just to remove them and loading the data distinct.
Another approach might be to combine the relevant fields within a first load, like:
F1 & '|' & F2 & '|' & F3 ... as Key
and within a second one you applies something like:
exists(Key) as Flag
and a third load may filter on the flag with: where Flag = 0;
This can be done in one load (adapting from @igoralcantara's answer):
Tablename:
Load
Min(Row_number) as Row_number,
FirstSortedValue(Id, Row_number) as Id,
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
From [YOUR SOURCE HERE]
Group By
[Start Station],
[Departure Time],
[Final Station],
[Arrival Time],
DepartureTime_Retur,
ArrivalTime_Return,
[Count Changes],
score
;
Hi Marcus, thanks for the answer. I'll need the Id as it will be used to filter a table that comes next. The Row_number though won't be needed. So I guess this solution might not work.
You may consider to change the order of script-statements within your ETL chain and/or to extract the id within a separate table to apply them in any mapping/exists-approaches.
Beside this the id's are all different - how will you ensure to remove/remain the needed ones? Or all id's exists within the other source, too - then which taken from there?
Therefore I suggest to review the sources and the data-quality at first before looking for technically solutions.
Thank you for the contributions both @igoralcantara and you. I decided to go for this solutions as it seemed to be the simplest, with just one load. But I guess what @igoralcantara would have worked as well since you adapted the script from his idea.
This worked, so thanks!