Hi all,
I have a quick question in order to see if I can store duplicated values into a single file in order to deal with them afterwards.
Imagine I have this TABLE_EXAMPLE (my real table is much bigger with over +200 fields):
field1 | field2 | field3 | field4 |
1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 |
Then I script
Load * distinct Resident TABLE_EXAMPLE;
and will get
field1 | field2 | field3 | field4 |
1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 |
Is there quick way I can get the "erased row" into another table, to get:
TABLE_DUPLICATED_ROW:
field1 | field2 | field3 | field4 |
1 | 1 | 1 | 1 |
Like a
load * "duplicated" Resident TABLE_EXAMPLE?
I want to avoid to "build" a 200 file-key.
Kind regards.
Give this a shot
TABLE_EXAMPLE:
LOAD * INLINE [
field1, field2, field3, field4
1, 1, 1, 1
1, 1, 1, 1
2, 2, 2, 2
3, 3, 3, 3
];
TempTable:
CrossTable (Field, Value)
LOAD 1 as RowNum,
*
Resident TABLE_EXAMPLE
Where RecNo() = 1;
TempTable2:
LOAD Concat(DISTINCT '[' & Field & ']', '&') as ConcatField1,
Concat(DISTINCT '[' & Field & ']', ',') as ConcatField2
Resident TempTable;
LET vConcatField1 = Peek('ConcatField1');
TRACE $(vConcatField1);
LET vConcatField2 = Peek('ConcatField2');
TRACE $(vConcatField2);
DROP Tables TempTable, TempTable2;
TABLE_EXAMPLE2:
LOAD *,
If(Key = Previous(Key), RangeSum(Peek('KeyTemp'), 1), 1) as KeyTemp,
Hash128($(vConcatField1)&AutoNumber(If(Key = Previous(Key), RangeSum(Peek('KeyTemp'), 1), 1), $(vConcatField1))) as Key2;
LOAD *,
Hash128($(vConcatField1)) as Key
Resident TABLE_EXAMPLE
Order By $(vConcatField2);
DROP Table TABLE_EXAMPLE;
RENAME Table TABLE_EXAMPLE2 to TABLE_EXAMPLE;
Table:
LOAD DISTINCT $(vConcatField2),
Hash128($(vConcatField1)&1) as Key3
Resident TABLE_EXAMPLE;
Duplicate:
LOAD DISTINCT $(vConcatField2),
'Duplicate' as DuplicateFlag
Resident TABLE_EXAMPLE
Where not Exists(Key3, Key2);
DROP Table TABLE_EXAMPLE;
DROP Field Key3;
I think you need to add field and use Count(duplicatedfield) as count and group by some ID.
Then you can resident load table with condition Where(Count)>1
Without the key might not be possible, but you can dynamically create the key field like this
TABLE_EXAMPLE:
LOAD * INLINE [
field1, field2, field3, field4
1, 1, 1, 1
1, 1, 1, 1
2, 2, 2, 2
3, 3, 3, 3
];
TempTable:
CrossTable (Field, Value)
LOAD 1 as RowNum,
*
Resident TABLE_EXAMPLE;
TempTable2:
LOAD Concat(DISTINCT '[' & Field & ']', '&') as ConcatField1,
Concat(DISTINCT '[' & Field & ']', ',') as ConcatField2
Resident TempTable;
LET vConcatField1 = Peek('ConcatField1');
TRACE $(vConcatField1);
LET vConcatField2 = Peek('ConcatField2');
TRACE $(vConcatField2);
DROP Tables TempTable, TempTable2;
Left Join (TABLE_EXAMPLE)
LOAD $(vConcatField2),
Count($(vConcatField)) as Count
Resident TABLE_EXAMPLE
Group By $(vConcatField2);
Table:
LOAD DISTINCT $(vConcatField2)
Resident TABLE_EXAMPLE;
Duplicate:
LOAD DISTINCT $(vConcatField2),
'Duplicate' as DuplicateFlag
Resident TABLE_EXAMPLE
Where Count > 1;
DROP Table TABLE_EXAMPLE;
Yeah, that´s the approach.
But I wanted to know if there were any "out of the box" solution, like a "reverse distinct" 🙂
Seems it does not.
Thanks!
Thanks a lot.
Just a slight ammend, just in case someone copy&paste your code.
There´s a missed 1 in line 3.
Left Join (TABLE_EXAMPLE)
LOAD $(vConcatField2),
Count($(vConcatField1)) as Count
Resident TABLE_EXAMPLE
Group By $(vConcatField2);
The code works fantastic and it´s automated, but with a +200 field per row and +10M file row, it makes the server go nuts : )
Anyhow, so much appreciated!
Any one with a "not su much process consuming " idea :' (
Yup, my bad for the typo....
I guess the number of fields might not be causing the issue, but the Group By is... we might be able to optimize this using Where Exist... let me play around with it.
Give this a shot
TABLE_EXAMPLE:
LOAD * INLINE [
field1, field2, field3, field4
1, 1, 1, 1
1, 1, 1, 1
2, 2, 2, 2
3, 3, 3, 3
];
TempTable:
CrossTable (Field, Value)
LOAD 1 as RowNum,
*
Resident TABLE_EXAMPLE
Where RecNo() = 1;
TempTable2:
LOAD Concat(DISTINCT '[' & Field & ']', '&') as ConcatField1,
Concat(DISTINCT '[' & Field & ']', ',') as ConcatField2
Resident TempTable;
LET vConcatField1 = Peek('ConcatField1');
TRACE $(vConcatField1);
LET vConcatField2 = Peek('ConcatField2');
TRACE $(vConcatField2);
DROP Tables TempTable, TempTable2;
TABLE_EXAMPLE2:
LOAD *,
If(Key = Previous(Key), RangeSum(Peek('KeyTemp'), 1), 1) as KeyTemp,
Hash128($(vConcatField1)&AutoNumber(If(Key = Previous(Key), RangeSum(Peek('KeyTemp'), 1), 1), $(vConcatField1))) as Key2;
LOAD *,
Hash128($(vConcatField1)) as Key
Resident TABLE_EXAMPLE
Order By $(vConcatField2);
DROP Table TABLE_EXAMPLE;
RENAME Table TABLE_EXAMPLE2 to TABLE_EXAMPLE;
Table:
LOAD DISTINCT $(vConcatField2),
Hash128($(vConcatField1)&1) as Key3
Resident TABLE_EXAMPLE;
Duplicate:
LOAD DISTINCT $(vConcatField2),
'Duplicate' as DuplicateFlag
Resident TABLE_EXAMPLE
Where not Exists(Key3, Key2);
DROP Table TABLE_EXAMPLE;
DROP Field Key3;
Luis, did Sunny's last post and example get you a good working solution? If so, please be sure to come back and use the Accept as Solution button on that post to let others know that it worked and to give Sunny credit for the help. If you have further questions, leave an update.
Regards,
Brett
Sorry Brett. My bad,
Solution accepted!