Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am dealing with a table with running records where one person should only appear once on a day. There are records where one person appear twice on a day. I need to remove these "duplicates" from the table before I can do any analysis. The other fields of the records might not be the same. I plan to keep only the records that have the latest Transaction ID. Eg. if a person has two records on the same day, one Transaction ID is 9011 the other is 9064, I want to keep 9064.
How can I do it? Can I do it in the LOAD step or in expressions?
Thanks in advance!
Hi
i would do it in 2 smal steps during load script:
1 step is to select transactions by aggregating data only to max Transaction Id for each person and then as i assume transaction id is unique you just left joining all transactions to it which will keep only initially loaded records.
remember that in second step you have to use the same field name for transaction Id as in the first step so the join can work properly.
Transactions:
Load
max(transId) as TransactionId
person
From
Your Source
Group by
person
;
Left Join (Transactions)
Load
*
From
Your Source
;
kind regards
Lech
Thanks Lech.
Yes the transaction IDs are unique. But I think your method will remove all the other dates for the same person as well?
An example of my table would look like below (of course there are many other fields apart from these three fields). As you can see, one person can attend multiple times over different dates. What I want to remove is only the ones where they attend on the same day. So in below example, I only want to remove the record with TransID 9601, because on that day the person attended twice, and I only want to keep the latest record.
TransID | Date | Person ID |
9616 | 29/11/2016 | 41500 |
9601 | 29/11/2016 | 41500 |
7150 | 01/03/2016 | 41500 |
9570 | 08/11/2016 | 44626 |
8174 | 08/06/2016 | 44626 |
6744 | 06/11/2015 | 44626 |
3920 | 31/03/2015 | 44626 |
1971 | 18/11/2014 | 44626 |
8466 | 14/07/2016 | 2004647 |
8213 | 20/06/2016 | 2004647 |
7705 | 02/05/2016 | 2004647 |
6186 | 07/10/2015 | 2004647 |
5616 | 24/08/2015 | 2004647 |
8454 | 27/07/2016 | 2005542 |
8247 | 21/06/2016 | 2005542 |
8005 | 17/05/2016 | 2005542 |
6249 | 20/10/2015 | 2005542 |
1403 | 30/10/2014 | 2005542 |
How do I remove this record without removing all the other records with different dates?
And is it possible to do it in expression instead of LOAD (as I might want to keep the original total number of records)?
Thanks!
Hi, then include date in your first load. Just ot make sure it is not timestamp i would use floor(date to convert timestamps to dates) and also use Floor(Date on your second table) or don't use date at all on second one.
Transactions:
Load
max(transId) as TransactionId,
Floor(Date) as Date,
person
From
Your Source
Group by
person,
Floor(Date)
;
Left Join (Transactions)
Load
*
From
Your Source
;
Hi,
Hope the below works!
A:
LOAD
TransID,
"Date",
"Person ID"
FROM [lib://Desktop/Book1345.xlsx]
(ooxml, embedded labels, table is Sheet1);
B:
Load
max(TransID) as TransID,
"Person ID",
"Date"
Resident A Group by
"Person ID","Date";
Drop Table A;
Thanks
Thanks Lech.
I have tested your method and the record 9601 does seem to disappear from the TransactionId field. However this record doesn't seem to have been removed, but replaced by 9616, which means that there are two records marked as 9616. And the measures will look at both record instead of the original 9616 one. Eg. the data table is like this (I added an amount field so you can see how measures are done):
TransID | Date | Person ID | Amount |
9616 | 29/11/2016 | 41500 | 2000 |
9601 | 29/11/2016 | 41500 | 600 |
7150 | 01/03/2016 | 41500 | 700 |
9570 | 08/11/2016 | 44626 | 800 |
8174 | 08/06/2016 | 44626 | 900 |
6744 | 06/11/2015 | 44626 | 1000 |
3920 | 31/03/2015 | 44626 | 1100 |
1971 | 18/11/2014 | 44626 | 1200 |
8466 | 14/07/2016 | 2004647 | 1300 |
8213 | 20/06/2016 | 2004647 | 1400 |
7705 | 02/05/2016 | 2004647 | 1500 |
6186 | 07/10/2015 | 2004647 | 1600 |
5616 | 24/08/2015 | 2004647 | 1700 |
8454 | 27/07/2016 | 2005542 | 1800 |
8247 | 21/06/2016 | 2005542 | 1900 |
8005 | 17/05/2016 | 2005542 | 2000 |
6249 | 20/10/2015 | 2005542 | 2100 |
1403 | 30/10/2014 | 2005542 | 2200 |
Then when I look at average, Person ID 41500 would show the average of the three figures ((2000+600+700)/3=1100) instead of two figures ((2000+700)/2=1350). Please see below:
And also other measures show the same issue:
I need record 9601 to be completely removed from all calculations. How do I do it?
Thanks!
Hi, record 9601 should be removed. with this method, because join is done on Transaction id, person and date. As you know in first step we are removing that record.
please post your script so i can check where you is mistake.
I must have done something wrong then...Below are my LOAD script:
Transactions:
Load
max(TransID) as TransactionId,
Date(Date,'DD/MM/YYYY') as Date,
[Person ID]
FROM [lib://test/test data.xlsx]
(ooxml, embedded labels, table is Sheet1)
Group by
[Person ID],
Date
;
Left Join (Transactions)
Load
*
FROM [lib://test/test data.xlsx]
(ooxml, embedded labels, table is Sheet1);
Something like below,
Data:
LOAD TransID,
Date(Date,'DD/MM/YYYY') as Date,
[Person ID],
Amount
FROM [lib://test/test data.xlsx]
(ooxml, embedded labels, table is Sheet1);
Inner Join
LOAD [Person ID],
Date,
Max(TransID) as TransID
Resident Data Group by [Person ID], Date;
Below link might be helpful for you to understand about joins.
Thanks Tamil. I think I understand what my problem was: I kept both TransactionId and TransID. I should have kept only one ID....