Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I am working on data Quality issues.
for filtering purpose I need to flag all duplicate records from one particulate field.
I used a method described below.
Is there any other method- better - exist?
---------------------------------------------------------------------------
Flagging Duplicates of Field - [Key]
data:
LOAD * Inline [
ID, Key, Flag, Date
1, A1, 1, 10
2, A1, 1, 20
3, A2, 1, 10
4, A2, 1, 30
5, A3, 1, 04
6, A4, 1, 30
7, A1, 1, 10
8, A5, 1, 30
9, A1, 1, 04
];
Map_Duplicate:
Mapping load
Key,
sum(Flag) as Frq
Resident data Group by Key;
data2:
load *,
if( ApplyMap('Map_Duplicate',Key) >1,'Yes', 'No') as [Is Duplicate]
Resident data;
drop Table data;
Message was edited by: Prasanna Balachandran
Can you explain your requirement. What is expected output?
May be a simpler way:
data:
LOAD * Inline [
ID, Key, Flag, Date
1, A1, 1, 10
2, A1, 1, 20
3, A2, 1, 10
4, A2, 1, 30
5, A3, 1, 04
6, A4, 1, 30
7, A1, 1, 10
8, A5, 1, 30
9, A1, 1, 04
];
Join
Map_Duplicate:
//Mapping load
Load
Key,
//sum(Flag) as Frq,
If(Count(Key)>1, 'Yes', 'No') as Duplicate
Resident data Group by Key;
//data2:
//load *,
//if(ApplyMap('Map_Duplicate',Key)>1,'Yes', 'No') as Duplicate
//
//
//Resident data;
//
//drop Table data;
for filtering purpose I need to flag all duplicate records
Excellent , Thank you
I think it should be Count(ID)>1
...
Join
Duplicate:
Load
Key,
If(Count(ID)>1, 'Yes', 'No') as Duplicate
Resident data Group by Key;
...
Count(Non_nullable_field)
You can use any non nullable field. You will get same result whether you use Count(ID) or Count(Key)