Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi All
I have a table with 2 columns. GUID, Date
GUID, Date
1, 1/30/2012
2, 1/01/2012
3, 1/1/2012
1, 1/31/2012
5, 1/5/2012
What's the best to remove the duplicate GUID based on the latest Date?
From the table above,
I want to get this:
2, 1/01/2012
3, 1/1/2012
1, 1/31/2012
5, 1/5/2012
I have tried using the peek function and doing a load ordered by guid desc, date desc, but it still leaves some duplicates and I don't undertand why.
Is there a better way to do this?
You need to use an aggregation function around each field that you don't list with group by clause but that use in the load statement.
(Or in other, the Help file words:
group by is a clause used for defining over which fields the data should be aggregated (grouped). The aggregation fields should be included in some way in the expressions loaded. No other fields than the aggregation fields may be used outside aggregation functions in the loaded expressions. )
So you could use something like
LOAD
GUID,
Date(max(Date),'M/DD/YYYY') as Date,
Date(FirstSortedValue(Closed_date, -Date),'M/DD/YYYY') as Closed_date
resident Table group by GUID;
You could do a group by GUID load and finding the latest Date with max(Date):
LOAD
GUID,
Date(max(Date),'M/DD/YYYY') as Date
resident Table group by GUID;
Field Date must be of Date Type with a numerical representation, so that max() function will work on the numerical representation.
Hope this helps,
Stefan
HI Stefan
It worked great. However, when I had some additional columns in my table, it gave an invalid expression:
GUID, Date, Closed_date
1, 1/30/2012, 1/31/2012
2, 1/01/2012, 1/03/2012
3, 1/1/2012, 1/04/2012
1, 1/31/2012, 2/2/2012
5, 1/5/2012, 1/6/2012
How come it only works with 2 columns?
You need to use an aggregation function around each field that you don't list with group by clause but that use in the load statement.
(Or in other, the Help file words:
group by is a clause used for defining over which fields the data should be aggregated (grouped). The aggregation fields should be included in some way in the expressions loaded. No other fields than the aggregation fields may be used outside aggregation functions in the loaded expressions. )
So you could use something like
LOAD
GUID,
Date(max(Date),'M/DD/YYYY') as Date,
Date(FirstSortedValue(Closed_date, -Date),'M/DD/YYYY') as Closed_date
resident Table group by GUID;
That answered my question. Thank you very much!
I just realized that my data has duplicate sort-weight which is causing null to be returned.
In another words:
GUID, Date, Closed_date
1, 1/30/2012, 1/31/2012
2, 1/01/2012, 1/03/2012
3, 1/1/2012, 1/04/2012
1, 1/31/2012, 2/2/2012
5, 1/5/2012, 1/6/2012
1, 1/30/2012, 1/31/2012
when firstsortedvalue (guid, closed_date) for GUID 1 returns NULL.
Is there another way to eliminate duplicates without using firstsortedvalue?
If you are coping with duplicate rows, you can use a distinct load.
And you can also add distinct keyword to your Firstsortedvalue() function to return the first record if you have duplicate rows with same sort weight (and still get the record with the min/max date for the other records):
firstsortedvalue (distinct guid, closed_date)
You can also use an inner join to get all the columns after getting the max date:
LOAD
GUID,
Date(max(Date),'M/DD/YYYY') as Date
resident Table group by GUID;
inner join load
GUID
Date
Closed_Date
resident Table;
It seems using the previous function worked the way I wanted.
Please let me know if I'm doing anything wrong. I couldn't get peek to work and I couldn't get firstsortedvalue to work.
Thanks!
firstsortedvalue() with distinct should work (see attached). The editor highlights a syntax error, but this is one of the editor bugs, I think.