Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
andymanu
Creator II
Creator II

Remove duplicate values based on a selected column value in data load

Hi All,

I am currently working on developing an application and need to remove duplicate records only for a selected data rows in the data load event.

Please refer the sample table below;

 
Customer_IDTransaction_DateFunding_ArrangementClassHoursEmployee
1002/01/2020CWHB3David
1015/01/2020SCOA2David
1015/01/2020SCOA2David
1023/01/2020HACCC1David
1035/01/2020CWHB4Andy
1047/01/2020SCO 3Andy
1048/01/2020SCO 3David
1058/01/2020HACC 2David
1069/01/2020CWHB1.5David
10710/01/2020HACCC2Andy
10710/01/2020HACCC2Andy
10812/01/2020CWH 3Andy
10913/01/2020HACCA2.5Andy
11010/01/2020SCOB2Andy

 

According to the above data set, how I figure out the duplicate values are based on three columns. Those are, "Customer_ID", "Transaction_Date" and "Class".

Further, the removal of duplicate values has to be only for the records where the "Funding_Arrangement" is 'SCO' and not for any other "Funding_Arrangement"s (for example, CWH, HACC, etc.).

In order to achieve this, I thought of creating a "Key" field as;

Funding_Arrangement & Customer_ID & Transaction_Date & "Class" AS KEY     in the data load event and to remove the duplicate records WHICH STARTS with the letters, "SCO........".

Additionally, I thought of using the function "Peek()" in the Where Clause of the load script something like;

WHERE Peek(KEY)<>KEY ;

However, I am not sure how to use the Peek() function with the starts with letters to remove the duplicate value where the "Funding_Arrangement" is only equals to 'SCO'.

Please advice me on my said requirement and appreciate a lot if you could let me know a better way of playing around this issue.

Thank you in advance.

Kind regards,

Andy

Labels (1)
1 Solution

Accepted Solutions
Rodj
Luminary Alumni
Luminary Alumni

That is really weird! If I load just the records <> 'SCO' it leaves the duplicate HACC records in however if I also load (and deduplicate) the 'SCO' records it removes the HACC duplicate as well, even though logically there's no way it should be doing so. I tried this another way that assumes that you aren't going to add any other columns and just did a simple distinct:

(with the same temp table load as before)

DeDup:
NoConcatenate
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
RowNo() as Counter
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

Concatenate
Load Distinct
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee]
Resident DeDupTemp
Where [Funding_Arrangement] = 'SCO';

drop table DeDupTemp; drop Field Counter;

NOTE that I added a rownum() into the load of the non SCO records (and later delete it), this appears to keep the duplicated HACC record and achieves what you want. I still can't fathom why the HACC record goes missing without it. There's either a gap in my understanding somewhere or a bug in the Qlik engine.

Your method of using Peek will work provided you:

  • Do a separate load of the 'SCO' records and concatenate it to a load of the non 'SCO' records 
  • Sort the records so that the duplicate records are always in order (safer to assume that the won't appear as they do in your sample data.

I reckon the method I have included here will be quicker on large volumes, but you might not have a large enough volume to worry about.

 

View solution in original post

6 Replies
Rodj
Luminary Alumni
Luminary Alumni

Here is how I'd go about it, I think you're on the right track with creating a key. There's probably another way to do this that might be more efficient, I'll have a think about it if I have time. Note that I created a quick excel file as my source but if you replace that reference with yours it should work for you.

// load the table into Qlik memory for faster work
DeDupTemp:
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
[Customer_ID]&'|'&[Transaction_Date]&'|'&[Funding_Arrangement] as DeDupKey
FROM [lib://AttachedFiles/test.xlsx]
(ooxml, embedded labels, table is Sheet1);

// Start to build our final table
NoConcatenate
DeDup:
Load Distinct // This will give us a list of unique keys to join back on to
DeDupKey
resident DeDupTemp
where [Funding_Arrangement] = 'SCO';
left Join // so that we only get the one record from the original table
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey
resident DeDupTemp;

DeDup:
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

drop table DeDupTemp;

 

Cheers,

Rod

andymanu
Creator II
Creator II
Author

Hi Rod,

Thank you very much for your respond.

I ran your code and still it gives me the same results I got with my previous attempt.

Yes, it removes the duplicate record related to "Customer_D", 101 where the "Funding_Arrangement" is 'SCO' but also it removes the duplicate record related to the "Customer_ID" 107 where the respective "Funding_Arrangement" is 'HACC'.

My code was something like below;

Funded_Data:
LOAD
RecNo(),
Customer_ID,
Attendance,
Transaction_Date,
Funding_Arrangement,
Program,
"Class",
Customer_ID & Transaction_Date & "Class" AS KEY,
"Hours",
Employee
FROM [lib://ANDREW/Test_Apps/Fill Blank Cells/Funded Data.xlsx]
(ooxml, embedded labels, table is [Funded Data]);

NoConcatenate
LOAD *
Resident Funded_Data
WHERE Peek(KEY)<>KEY ;
DROP TABLE Funded_Data;

I have attached my sample data file herewith. 

Appreciate your valuable feedback.

Thank you.

kind regards,

Andy

Rodj
Luminary Alumni
Luminary Alumni

That is really weird! If I load just the records <> 'SCO' it leaves the duplicate HACC records in however if I also load (and deduplicate) the 'SCO' records it removes the HACC duplicate as well, even though logically there's no way it should be doing so. I tried this another way that assumes that you aren't going to add any other columns and just did a simple distinct:

(with the same temp table load as before)

DeDup:
NoConcatenate
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
RowNo() as Counter
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

Concatenate
Load Distinct
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee]
Resident DeDupTemp
Where [Funding_Arrangement] = 'SCO';

drop table DeDupTemp; drop Field Counter;

NOTE that I added a rownum() into the load of the non SCO records (and later delete it), this appears to keep the duplicated HACC record and achieves what you want. I still can't fathom why the HACC record goes missing without it. There's either a gap in my understanding somewhere or a bug in the Qlik engine.

Your method of using Peek will work provided you:

  • Do a separate load of the 'SCO' records and concatenate it to a load of the non 'SCO' records 
  • Sort the records so that the duplicate records are always in order (safer to assume that the won't appear as they do in your sample data.

I reckon the method I have included here will be quicker on large volumes, but you might not have a large enough volume to worry about.

 

Rodj
Luminary Alumni
Luminary Alumni

Another way you could do this is to flag the duplicates, you can then keep them in the data in case you want to be aware of them later (e.g, if this is a data quality issue it can be useful to make them visible) or you can then just load the ones not flagged as duplicates:

DeDup:
NoConcatenate
Load
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey,
if(Peek(DeDupKey) = DeDupKey, 'Y', 'N') as Duplicate
Resident DeDupTemp
order by DeDupKey;

That is probably a neater solution that gives you more flexibility and doesn't seem to have the issue with the unintended disappearing records.

 

andymanu
Creator II
Creator II
Author

Hi Rod,

Thank you very much. It's working now as intended.

Was able to learn something new.

Kind regards,

Andy

Rodj
Luminary Alumni
Luminary Alumni

No problem. Don't forget to mark a post as the solution in order to help others.

Cheers,

Rod