Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I have an existing tool where it will searches the mistakes in flat files of users and display in the Mistake monitor qvw. Mistakes would be like Field not found, Bad zip file as below.
The above process takes place post extraction layer and before the Transformation layer. The input file of the transformation layer has some duplicate records based on the product description field.
It is acceptable for the product description to appear more than once if the SKU is reporting to a different Market Product Type or Sales Channel , but if the combination of Product Description, Sales Channel and Market Product Type is duplicated then we have trouble.
This scenario is OK as all records are unique:
Sales Channel | SKU | Manufacturer | Market Product Type |
Hypermarkets | UB Long Grain 1kg BiB | Horlicks | Long Grain |
Hypermarkets | UBLong Grain 1kg BiB | Horlicks | Boil In Bag |
Rewe | UB Long Grain 1kg BiB | Horlicks | Long Grain |
Rewe Hypermarkets | UB Long Grain 1kg BiB | Horlicks | Long Grain |
Rewe Hypermarkets | UB Long Grain 1kg BiB | Horlicks | Boil in Bag |
But if we have the below then we will get incorrect calculations in the next layer:
Sales Channel | SKU | Manufacturer | Market Product Type |
Hypermarkets | UB Long Grain 1kg BiB | Horlicks | Long Grain |
Hypermarkets | Uncle Bens Long Grain 1kg BiB | Horlicks | Boil In Bag |
Rewe | Uncle Bens Long Grain 1kg BiB | Horlicks | Long Grain |
Rewe Hypermarkets | Uncle Bens Long Grain 1kg BiB | Horlicks | Long Grain |
Rewe Hypermarkets | Uncle Bens Long Grain 1kg BiB | Horlicks | Boil in Bag |
Hypermarkets | UB Long Grain 1kg BiB | Horlicks | Long Grain |
Is there any way to check for this duplication before the files are uploaded to next layer.
Please provide your suggestions how to remove the duplicates in the existing model.
Thanks in advance!!
Regards
Velan
Message was edited by: Velan Krishna
To avoid duplicated rows you could use a Load DISTINCT. Alternatively to show which rows are duplicates you could count them in a (additionally) load:
CountRows:
Load ID, if(count([Sales Channel]&[SKU]&[Market Product Type])=1, 'yes', 'no') as Duplicates Resident x group by ID;
- Marcus
Thank you for your valuable time Marcus!! Let me try this and update.
-Velan