Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to remove row if two fields are the same

I am dealing with a table with running records where one person should only appear once on a day. There are records where one person appear twice on a day. I need to remove these "duplicates" from the table before I can do any analysis. The other fields of the records might not be the same. I plan to keep only the records that have the latest Transaction ID. Eg. if a person has two records on the same day, one Transaction ID is 9011 the other is 9064, I want to keep 9064.

How can I do it? Can I do it in the LOAD step or in expressions?

Thanks in advance!

21 Replies
Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi

i would do it in 2 smal steps during load script:

1 step is to select transactions by aggregating data only to max Transaction Id for each person and then as i assume transaction id is unique you just left joining all transactions to it which will keep only initially loaded records.

remember that in second step you have to use the same field name for transaction Id as in the first step so the join can work properly.

Transactions:

Load

     max(transId) as TransactionId

     person

From

     Your Source

Group by

     person

;

Left Join (Transactions)

Load

     *

From

     Your Source

;

kind regards

Lech

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
Anonymous
Not applicable
Author

Thanks Lech.

Yes the transaction IDs are unique. But I think your method will remove all the other dates for the same person as well?

An example of my table would look like below (of course there are many other fields apart from these three fields). As you can see, one person can attend multiple times over different dates. What I want to remove is only the ones where they attend on the same day. So in below example, I only want to remove the record with TransID 9601, because on that day the person attended twice, and I only want to keep the latest record.

   

TransIDDatePerson ID
961629/11/201641500
960129/11/201641500
715001/03/201641500
957008/11/201644626
817408/06/201644626
674406/11/201544626
392031/03/201544626
197118/11/201444626
846614/07/20162004647
821320/06/20162004647
770502/05/20162004647
618607/10/20152004647
561624/08/20152004647
845427/07/20162005542
824721/06/20162005542
800517/05/20162005542
624920/10/20152005542
140330/10/20142005542

How do I remove this record without removing all the other records with different dates?

And is it possible to do it in expression instead of LOAD (as I might want to keep the original total number of records)?

Thanks!

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi, then include date in your first load. Just ot make sure it is not timestamp i would use floor(date to convert timestamps to dates) and also use Floor(Date on your second table) or don't use date at all on second one.

Transactions:

Load

     max(transId)      as TransactionId,

     Floor(Date)         as Date,

     person

From

     Your Source

Group by

     person,

     Floor(Date)

;

Left Join (Transactions)

Load

     *

From

     Your Source

;

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
arasantorule
Creator III
Creator III

Hi,

Hope the below works!

A:

LOAD

    TransID,

    "Date",

    "Person ID"

FROM [lib://Desktop/Book1345.xlsx]

(ooxml, embedded labels, table is Sheet1);

B:

Load

     max(TransID) as TransID,

    "Person ID",

    "Date"

Resident A Group by

     "Person ID","Date";

   

     Drop Table A;

Thanks

Anonymous
Not applicable
Author

Thanks Lech.

I have tested your method and the record 9601 does seem to disappear from the TransactionId field. However this record doesn't seem to have been removed, but replaced by 9616, which means that there are two records marked as 9616. And the measures will look at both record instead of the original 9616 one. Eg. the data table is like this (I added an amount field so you can see how measures are done):

  

TransIDDatePerson IDAmount
961629/11/2016415002000
960129/11/201641500600
715001/03/201641500700
957008/11/201644626800
817408/06/201644626900
674406/11/2015446261000
392031/03/2015446261100
197118/11/2014446261200
846614/07/201620046471300
821320/06/201620046471400
770502/05/201620046471500
618607/10/201520046471600
561624/08/201520046471700
845427/07/201620055421800
824721/06/201620055421900
800517/05/201620055422000
624920/10/201520055422100
140330/10/201420055422200

Then when I look at average, Person ID 41500 would show the average of the three figures ((2000+600+700)/3=1100) instead of two figures ((2000+700)/2=1350). Please see below:

1.JPG

And also other measures show the same issue:

2.JPG

I need record 9601 to be completely removed from all calculations. How do I do it?

Thanks!

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi, record 9601 should be removed. with this method, because join is done on Transaction id, person and date. As you know in first step we are removing that record.

please post your script so i can check where you is mistake.

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
Anonymous
Not applicable
Author

I must have done something wrong then...Below are my LOAD script:

Transactions:

Load

     max(TransID) as TransactionId,

     Date(Date,'DD/MM/YYYY') as Date,

     [Person ID]

FROM [lib://test/test data.xlsx]

(ooxml, embedded labels, table is Sheet1)

Group by

     [Person ID],

     Date

;

Left Join (Transactions)

Load

     *

FROM [lib://test/test data.xlsx]

(ooxml, embedded labels, table is Sheet1);

tamilarasu
Champion
Champion

Something like below,

Data:
LOAD TransID,
Date(Date,'DD/MM/YYYY') as Date,
[Person ID],
Amount
FROM [lib://test/test data.xlsx]
(
ooxml, embedded labels, table is Sheet1);

Inner Join
LOAD [Person ID],
Date,
Max(TransID) as TransID
Resident Data Group by [Person ID], Date
;


Below link might be helpful for you to understand about joins.


Understanding Join, Keep and Concatenate

Anonymous
Not applicable
Author

Thanks Tamil. I think I understand what my problem was: I kept both TransactionId and TransID. I should have kept only one ID....