Skip to main content
Announcements
Customer Spotlight: Discover what’s possible with embedded analytics Oct. 16 at 10:00 AM ET: REGISTER NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Need help with removing duplicates from a table

Hi All

I have a table with 2 columns.  GUID, Date

GUID, Date

1, 1/30/2012

2, 1/01/2012

3, 1/1/2012

1, 1/31/2012

5, 1/5/2012

What's the best to remove the duplicate GUID based on the latest Date?

From the table above,

I want to get this:

2, 1/01/2012

3, 1/1/2012

1, 1/31/2012

5, 1/5/2012

I have tried using the peek function and doing a load ordered by guid desc, date desc, but it still leaves some duplicates and I don't undertand why.

Is there a better way to do this?

1 Solution

Accepted Solutions
swuehl
MVP
MVP

You need to use an aggregation function around each field that you don't list with group by clause but that use in the load statement.

(Or in other, the Help file words:

group by is a clause used for defining over which fields the data should be aggregated (grouped). The aggregation fields should be included in some way in the expressions loaded. No other fields than the aggregation fields may be used outside aggregation functions in the loaded expressions. )

So you could use something like

LOAD

GUID,

Date(max(Date),'M/DD/YYYY') as Date,

Date(FirstSortedValue(Closed_date,  -Date),'M/DD/YYYY') as Closed_date

resident Table group by GUID;

View solution in original post

12 Replies
swuehl
MVP
MVP

You could do a group by GUID load and finding the latest Date with max(Date):

LOAD

GUID,

Date(max(Date),'M/DD/YYYY') as Date

resident Table group by GUID;

Field Date must be of Date Type with a numerical representation, so that max() function will work on the numerical representation.

Hope this helps,

Stefan

Not applicable
Author

HI Stefan

It worked great.  However, when I had some additional columns in my table, it gave an invalid expression:

GUID, Date, Closed_date

1, 1/30/2012, 1/31/2012

2, 1/01/2012, 1/03/2012

3, 1/1/2012, 1/04/2012

1, 1/31/2012, 2/2/2012

5, 1/5/2012, 1/6/2012

How come it only works with 2 columns?

swuehl
MVP
MVP

You need to use an aggregation function around each field that you don't list with group by clause but that use in the load statement.

(Or in other, the Help file words:

group by is a clause used for defining over which fields the data should be aggregated (grouped). The aggregation fields should be included in some way in the expressions loaded. No other fields than the aggregation fields may be used outside aggregation functions in the loaded expressions. )

So you could use something like

LOAD

GUID,

Date(max(Date),'M/DD/YYYY') as Date,

Date(FirstSortedValue(Closed_date,  -Date),'M/DD/YYYY') as Closed_date

resident Table group by GUID;

Not applicable
Author

That answered my question.  Thank you very much!

Not applicable
Author

I just realized that my data has duplicate sort-weight which is causing null to be returned.

In another words:

GUID, Date, Closed_date

1, 1/30/2012, 1/31/2012

2, 1/01/2012, 1/03/2012

3, 1/1/2012, 1/04/2012

1, 1/31/2012, 2/2/2012

5, 1/5/2012, 1/6/2012

1, 1/30/2012, 1/31/2012

when firstsortedvalue (guid, closed_date) for GUID 1 returns NULL.

Is there another way to eliminate duplicates without using firstsortedvalue? 

swuehl
MVP
MVP

If you are coping with duplicate rows, you can use a distinct load.

And you can also add distinct keyword to your Firstsortedvalue() function to return the first record if you have duplicate rows with same sort weight (and still get the record with the min/max date for the other records):

firstsortedvalue (distinct guid, closed_date)

danielact
Partner - Creator III
Partner - Creator III

You can also use an inner join to get all the columns after getting the max date:

LOAD

GUID,

Date(max(Date),'M/DD/YYYY') as Date

resident Table group by GUID;

inner join load

GUID

Date

Closed_Date

resident Table;

Not applicable
Author

It seems using the previous function worked the way I wanted.

Please let me know if I'm doing anything wrong.  I couldn't get peek to work and I couldn't get firstsortedvalue to work.

Thanks!

swuehl
MVP
MVP

firstsortedvalue() with distinct should work (see attached). The editor highlights a syntax error, but this is one of the editor bugs, I think.