Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
DGAL1
Contributor III
Contributor III

REMOVE DUPLICATES

I have database with these columns
IP_ADDRESS,
SALE_ID,
AGENT_NAME,
DATE,
QUANTITY,
CATEGORY,
AMOUNT,
POOL_NR,
AGENT_ID,

In the data source , once a transaction is editted it picks a new SALE_ID while maitaining all the values for the rest of the fields .
How do i filter out editted transactions to only pick one transaction instead of picking two transactions in my QVD.

Labels (2)
8 Replies
LRuCelver
Partner - Creator III
Partner - Creator III

Do I understand correctly that when a transaction is edited, all values are copied to a new entry with a new SALE_ID?

And you want only the last one?

DGAL1
Contributor III
Contributor III
Author

Yes please

LRuCelver
Partner - Creator III
Partner - Creator III

You can group the data on all fields that are just copied and use e.g. LastValue() or Max() for the SALE_ID:

Data:
NoConcatenate Load * Inline [
IP_ADDRESS,	SALE_ID,	AGENT_NAME,	DATE,		QUANTITY,	CATEGORY,	AMOUNT,	POOL_NR,	AGENT_ID
    1.1.1.1,	1,			Alice,		01/01/2024,	1,			1,			1,		1,			1
    2.2.2.2,	2,			Bob,		02/01/2024,	2,			2,			2,		2,			2
    1.1.1.1,	3,			Alice,		01/01/2024,	1,			1,			1,		1,			1
    1.1.1.1,	4,			Alice,		01/01/2024,	1,			1,			1,		1,			1
    2.2.2.2,	5,			Bob,		02/01/2024,	2,			2,			2,		2,			2
    2.2.2.2,	6,			Bob,		02/01/2024,	2,			2,			2,		2,			2
];


Transactions:
NoConcatenate Load
	IP_ADDRESS,
    LastValue(SALE_ID) as SALE_ID,
    AGENT_NAME,
    DATE,
    QUANTITY,
    CATEGORY,
    AMOUNT,
    POOL_NR,
    AGENT_ID
Resident Data
Group By
	IP_ADDRESS,
    AGENT_NAME,
    DATE,
    QUANTITY,
    CATEGORY,
    AMOUNT,
    POOL_NR,
    AGENT_ID
;

Drop Table Data;
DGAL1
Contributor III
Contributor III
Author

My dataset has more than 5M rows of data

BrunPierre
Partner - Master
Partner - Master

@DGAL1, Do this with Max() or MaxString() depending on the SALE_ID values.

Data:
IP_ADDRESS,
SALE_ID,
AGENT_NAME,
DATE,
QUANTITY,
CATEGORY,
AMOUNT,
POOL_NR,
AGENT_ID

FROM ...;

Right Join
LOAD Max(SALE_ID) as SALE_ID

Resident Data;
Bolijay00
Contributor III
Contributor III

use distinct(SALE_ID)
LRuCelver
Partner - Creator III
Partner - Creator III

I timed grouping 10M rows into groups of a specific size. 5M rows shouldn't be a problem:

LRuCelver_0-1708527119810.png

Times: NoConcatenate Load Null() as BatchSize, Null() as Seconds AutoGenerate 0;

For Each vBatchSize in 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000

    Data:
    NoConcatenate Load
        Floor(((RecNo()-1)-1) / $(vBatchSize)) as IP_ADDRESS,
        (RecNo()-1) as SALE_ID,
        Floor((RecNo()-1) / $(vBatchSize)) as AGENT_NAME,
        Floor((RecNo()-1) / $(vBatchSize)) as DATE,
        Floor((RecNo()-1) / $(vBatchSize)) as QUANTITY,
        Floor((RecNo()-1) / $(vBatchSize)) as CATEGORY,
        Floor((RecNo()-1) / $(vBatchSize)) as AMOUNT,
        Floor((RecNo()-1) / $(vBatchSize)) as POOL_NR,
        Floor((RecNo()-1) / $(vBatchSize)) as AGENT_ID
    AutoGenerate 10000000;
    
    Let vStart = Now();
    
    Transactions:
    NoConcatenate Load
        IP_ADDRESS,
        LastValue(SALE_ID) as SALE_ID,
        AGENT_NAME,
        DATE,
        QUANTITY,
        CATEGORY,
        AMOUNT,
        POOL_NR,
        AGENT_ID
    Resident Data
    Group By
        IP_ADDRESS,
        AGENT_NAME,
        DATE,
        QUANTITY,
        CATEGORY,
        AMOUNT,
        POOL_NR,
        AGENT_ID
    ;

    Let vSeconds = Now() - vStart;
    Concatenate(Times) Load $(vBatchSize) as BatchSize, $(vSeconds) as Seconds AutoGenerate 1;
    
    Drop Tables Data, Transactions;
    
Next vBatchSize
Let vBatchSize;
Let vStart;
Let vSeconds;

 

Pallav
Contributor II
Contributor II

You can try this ->

Distinct SALE_ID,

IP_ADDRESS,
AGENT_NAME,
DATE,
QUANTITY,
CATEGORY,
AMOUNT,
POOL_NR,
AGENT_ID