Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello Experts
I have a data set like the below and would like to find out the first occurrence of the circuit id
if there are instances of the same circuit id found once again in the next invoice , I would like to treat this as old.
The no of circuits that I have is around 1 billion , how do I find the first occurrence and tag that as New and other occurrences as old without having a sync key.
CircuitID | Invoice no | Results |
1 | ABC-001 | New |
2 | ABC-001 | New |
3 | ABC-001 | New |
4 | ABC-002 | New |
3 | ABC-002 | Old |
2 | ABC-002 | Old |
1 | ABC-002 | Old |
6 | ABC-002 | New |
Hi @Santanunandy, I think the best way is using Exists() function with the invoice field, like this:
Load
CircuitID,
Invoice,
If(Exists(CircuitID), 'Old', 'New') AS Result
Inline [
CircuitID Invoice
1 ABC-001
2 ABC-001
3 ABC-001
4 ABC-002
3 ABC-002
2 ABC-002
1 ABC-002
6 ABC-002
] (delimiter is '\t');
JG
When you use Exists(), the circuit_id field already exists in table NewTableOne, so the values are not new...
You can use an auxiliary field for this:
...
FinalTable:
load
circuit_id,
file_name,
circuit_id AS ExistCircuitID,
If(Exists(ExistCircuitID, circuit_id), 'Old', 'New') AS Results
resident NewTableOne;
Drop Field ExistCircuitID;
...
JG
Hi @Santanunandy, I think the best way is using Exists() function with the invoice field, like this:
Load
CircuitID,
Invoice,
If(Exists(CircuitID), 'Old', 'New') AS Result
Inline [
CircuitID Invoice
1 ABC-001
2 ABC-001
3 ABC-001
4 ABC-002
3 ABC-002
2 ABC-002
1 ABC-002
6 ABC-002
] (delimiter is '\t');
JG
Hello ,
With the inline table the data is getting loaded but while I am trying to pick up the data from the .qvd files I can see only old values are picked up and not New .Can you pls help.
NewTable:
LOAD
circuit_id,
file_name
FROM [lib://QVD/CHARGE.qvd]
(qvd);
NoConcatenate
NewTableOne:
load
distinct circuit_id,
file_name
resident NewTable
order by file_name asc;
drop table NewTable;
NoConcatenate
FinalTable:
load circuit_id,file_name, if(Exists(circuit_id),'Old','New') as Results
resident NewTableOne;
drop table NewTableOne;
exit Script;
When you use Exists(), the circuit_id field already exists in table NewTableOne, so the values are not new...
You can use an auxiliary field for this:
...
FinalTable:
load
circuit_id,
file_name,
circuit_id AS ExistCircuitID,
If(Exists(ExistCircuitID, circuit_id), 'Old', 'New') AS Results
resident NewTableOne;
Drop Field ExistCircuitID;
...
JG
Thank you JD for replying but in the NewTableOne table I had 91 M rows but when the FinalTable was created I had only 79 M but my exception that the final table would also be having the same rows of rows as the Newtableone with results being with Old and New.
There are ~12 M rows missing after the final table have been created , Can you help me understand why is this the case.
Probably this is because you are using a distinct clause, so you have about 12 M duplicated rows.
JG
Thanks JD and Spot on .Thanks for all your help