Solved: Re: Remove duplicate values based on a selected co... - Qlik Community

andymanu · ‎2020-02-11

Hi All,

I am currently working on developing an application and need to remove duplicate records only for a selected data rows in the data load event.

Please refer the sample table below;

Customer_ID	Transaction_Date	Funding_Arrangement	Class	Hours	Employee
100	2/01/2020	CWH	B	3	David
101	5/01/2020	SCO	A	2	David
101	5/01/2020	SCO	A	2	David
102	3/01/2020	HACC	C	1	David
103	5/01/2020	CWH	B	4	Andy
104	7/01/2020	SCO		3	Andy
104	8/01/2020	SCO		3	David
105	8/01/2020	HACC		2	David
106	9/01/2020	CWH	B	1.5	David
107	10/01/2020	HACC	C	2	Andy
107	10/01/2020	HACC	C	2	Andy
108	12/01/2020	CWH		3	Andy
109	13/01/2020	HACC	A	2.5	Andy
110	10/01/2020	SCO	B	2	Andy

According to the above data set, how I figure out the duplicate values are based on three columns. Those are, "Customer_ID", "Transaction_Date" and "Class".

Further, the removal of duplicate values has to be only for the records where the "Funding_Arrangement" is 'SCO' and not for any other "Funding_Arrangement"s (for example, CWH, HACC, etc.).

In order to achieve this, I thought of creating a "Key" field as;

Funding_Arrangement & Customer_ID & Transaction_Date & "Class" AS KEY in the data load event and to remove the duplicate records WHICH STARTS with the letters, "SCO........".

Additionally, I thought of using the function "Peek()" in the Where Clause of the load script something like;

WHERE Peek(KEY)<>KEY ;

However, I am not sure how to use the Peek() function with the starts with letters to remove the duplicate value where the "Funding_Arrangement" is only equals to 'SCO'.

Please advice me on my said requirement and appreciate a lot if you could let me know a better way of playing around this issue.

Thank you in advance.

Kind regards,

Andy

Rodj · ‎2020-02-11

That is really weird! If I load just the records <> 'SCO' it leaves the duplicate HACC records in however if I also load (and deduplicate) the 'SCO' records it removes the HACC duplicate as well, even though logically there's no way it should be doing so. I tried this another way that assumes that you aren't going to add any other columns and just did a simple distinct:

(with the same temp table load as before)

DeDup:
NoConcatenate
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
RowNo() as Counter
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

Concatenate
Load Distinct
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee]
Resident DeDupTemp
Where [Funding_Arrangement] = 'SCO';

drop table DeDupTemp; drop Field Counter;

NOTE that I added a rownum() into the load of the non SCO records (and later delete it), this appears to keep the duplicated HACC record and achieves what you want. I still can't fathom why the HACC record goes missing without it. There's either a gap in my understanding somewhere or a bug in the Qlik engine.

Your method of using Peek will work provided you:

Do a separate load of the 'SCO' records and concatenate it to a load of the non 'SCO' records
Sort the records so that the duplicate records are always in order (safer to assume that the won't appear as they do in your sample data.

I reckon the method I have included here will be quicker on large volumes, but you might not have a large enough volume to worry about.

View solution in original post

Rodj · ‎2020-02-11

Here is how I'd go about it, I think you're on the right track with creating a key. There's probably another way to do this that might be more efficient, I'll have a think about it if I have time. Note that I created a quick excel file as my source but if you replace that reference with yours it should work for you.

// load the table into Qlik memory for faster work
DeDupTemp:
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
[Customer_ID]&'|'&[Transaction_Date]&'|'&[Funding_Arrangement] as DeDupKey
FROM [lib://AttachedFiles/test.xlsx]
(ooxml, embedded labels, table is Sheet1);

// Start to build our final table
NoConcatenate
DeDup:
Load Distinct // This will give us a list of unique keys to join back on to
DeDupKey
resident DeDupTemp
where [Funding_Arrangement] = 'SCO';
left Join // so that we only get the one record from the original table
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey
resident DeDupTemp;

DeDup:
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

drop table DeDupTemp;

Cheers,

Rod

andymanu · ‎2020-02-11

Hi Rod,

Thank you very much for your respond.

I ran your code and still it gives me the same results I got with my previous attempt.

Yes, it removes the duplicate record related to "Customer_D", 101 where the "Funding_Arrangement" is 'SCO' but also it removes the duplicate record related to the "Customer_ID" 107 where the respective "Funding_Arrangement" is 'HACC'.

My code was something like below;

Funded_Data:
LOAD
RecNo(),
Customer_ID,
Attendance,
Transaction_Date,
Funding_Arrangement,
Program,
"Class",
Customer_ID & Transaction_Date & "Class" AS KEY,
"Hours",
Employee
FROM [lib://ANDREW/Test_Apps/Fill Blank Cells/Funded Data.xlsx]
(ooxml, embedded labels, table is [Funded Data]);

NoConcatenate
LOAD *
Resident Funded_Data
WHERE Peek(KEY)<>KEY ;
DROP TABLE Funded_Data;

I have attached my sample data file herewith.

Appreciate your valuable feedback.

Thank you.

kind regards,

Andy

Rodj · ‎2020-02-11

That is really weird! If I load just the records <> 'SCO' it leaves the duplicate HACC records in however if I also load (and deduplicate) the 'SCO' records it removes the HACC duplicate as well, even though logically there's no way it should be doing so. I tried this another way that assumes that you aren't going to add any other columns and just did a simple distinct:

(with the same temp table load as before)

DeDup:
NoConcatenate
LOAD
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
RowNo() as Counter
resident DeDupTemp
where [Funding_Arrangement] <> 'SCO';

Concatenate
Load Distinct
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee]
Resident DeDupTemp
Where [Funding_Arrangement] = 'SCO';

drop table DeDupTemp; drop Field Counter;

NOTE that I added a rownum() into the load of the non SCO records (and later delete it), this appears to keep the duplicated HACC record and achieves what you want. I still can't fathom why the HACC record goes missing without it. There's either a gap in my understanding somewhere or a bug in the Qlik engine.

Your method of using Peek will work provided you:

Do a separate load of the 'SCO' records and concatenate it to a load of the non 'SCO' records
Sort the records so that the duplicate records are always in order (safer to assume that the won't appear as they do in your sample data.

I reckon the method I have included here will be quicker on large volumes, but you might not have a large enough volume to worry about.

Rodj · ‎2020-02-11

Another way you could do this is to flag the duplicates, you can then keep them in the data in case you want to be aware of them later (e.g, if this is a data quality issue it can be useful to make them visible) or you can then just load the ones not flagged as duplicates:

DeDup:
NoConcatenate
Load
[Customer_ID],
[Transaction_Date],
[Funding_Arrangement],
[Class],
[Hours],
[Employee],
DeDupKey,
if(Peek(DeDupKey) = DeDupKey, 'Y', 'N') as Duplicate
Resident DeDupTemp
order by DeDupKey;

That is probably a neater solution that gives you more flexibility and doesn't seem to have the issue with the unintended disappearing records.

andymanu · ‎2020-02-11

Hi Rod,

Thank you very much. It's working now as intended.

Was able to learn something new.

Kind regards,

Andy

Rodj · ‎2020-02-11

No problem. Don't forget to mark a post as the solution in order to help others.

Cheers,

Rod

Remove duplicate values based on a selected column value in data load

General Question