Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Removing duplicates, getting latest value via multiple criteria

Hi there,

I have a set of data like this

            

Product NoDate CreatedRecord NoDate AlteredReason
14/8/2014RN00014/6/2014Damaged
14/8/2014RN00254/6/2014AutoRepaired
14/1/2015RN90904/11/2014Damaged
25/9/2015RN00023/9/2015Creation
36/8/2015RN00036/7/2015Damaged
36/8/2015RN00046/7/2015AutoRepaired
310/10/2015RN00069/9/2015Damaged

I would like to remove duplicated based on product no, and get the latest value. Which should look like this

(This is to be done in load script)

Product NoDate CreatedRecord NoDate AlteredReason
14/1/2015RN90904/11/2014Damaged
25/9/2015RN00023/9/2015Creation
310/10/2015RN00069/9/2015Damaged

Basically for each product no, i would like to obtain the data that has the LATEST DateCreated,RecordNo,DateAltered.

Thank you.

1 Solution

Accepted Solutions
maxgro
MVP
MVP

another example, in this one I want the oldest date altered, first rec no, ; you can see I changed the fields order in order by and replace descending with ascending (asc is the default, no need to specify)

NewTable:

NoConcatenate load *

Resident Table

Where [Product No]<>Peek('Product No')

order by [Product No], [Date Altered], [Record No], [Date Created];

1.png

View solution in original post

32 Replies
sunny_talwar

Try this:

Table:

LOAD [Product No],

  FirstSortedValue([Date Created], -[Date Created]) as [Date Created],

  FirstSortedValue([Record No], -[Date Created]) as [Record No],

  FirstSortedValue([Date Altered], -[Date Created]) as [Date Altered],

  FirstSortedValue(Reason, -[Date Created]) as Reason

Group By [Product No];

LOAD * INLINE [

    Product No, Date Created, Record No, Date Altered, Reason

    1, 4/8/2014, RN0001, 4/6/2014, Damaged

    1, 4/8/2014, RN0025, 4/6/2014, AutoRepaired

    1, 4/1/2015, RN9090, 4/11/2014, Damaged

    2, 5/9/2015, RN0002, 3/9/2015, Creation

    3, 6/8/2015, RN0003, 6/7/2015, Damaged

    3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

    3, 10/10/2015, RN0006, 9/9/2015, Damaged

];


Capture.PNG

Not applicable
Author

Hi i used resident table, and your code, the result is that it removed all records that have duplicate, leaving only product code 2 behind when i tried searching for it.

This data set is only a small portion of my data, my real data has around 100k+ but the concept is exactly the same.

---------------------------

I tried your code, i know why it disappears, i was looking at a set of products, that are of all the same values for every field except for recordno.

Is it possible for it to remove all duplicates, and seeing if say we have this

This is the example set of data that disappeared.

    Product No, Date Created, Record No, Date Altered, Reason

    1, 4/8/2014, RN0001, 4/6/2014, Damaged

    1, 4/8/2014, RN0025, 4/6/2014, AutoRepaired

    1, 4/8/2014, RN9090, 4/6/2014, Damaged


Because the code locks it to datecreated, hence i can understand why it would disappear as all the datecreated are the same.

Is there a way for it to look at all 3 fields(datecreated, recordno and date altered)? For example

If all the 3 entries above for Date created are the same, it will look at record no, and pick the last record no.

If even the record no are all the same, then it will look at date altered and find the latest one.


in short is it possible for it to pick, the latest date created & last record no & latest date altered?

sunny_talwar

Try this code:

Table:

LOAD [Product No],

  FirstSortedValue([Date Created], -([Date Created]+RecNo())) as [Date Created],

  FirstSortedValue([Record No], -([Date Created]+RecNo())) as [Record No],

  FirstSortedValue([Date Altered], -([Date Created]+RecNo())) as [Date Altered],

  FirstSortedValue(Reason, -([Date Created]+RecNo())) as Reason

Group By [Product No];

LOAD * INLINE [

    Product No, Date Created, Record No, Date Altered, Reason

    1, 4/8/2014, RN0001, 4/6/2014, Damaged

    1, 4/8/2014, RN0025, 4/6/2014, AutoRepaired

    1, 4/8/2015, RN9090, 4/11/2014, Damaged

    2, 5/9/2015, RN0002, 3/9/2015, Creation

    3, 6/8/2015, RN0003, 6/7/2015, Damaged

    3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

    3, 10/10/2015, RN0006, 9/9/2015, Damaged

];

Not applicable
Author

Hi do you mind explaining how does adding recno() help? Thanks so much.


Is there another way to solve this, i have another part coming up doing the same thing, And this time the field will be mixture of text and numbers like the field in RECORD NO. Will it be confused if a larger set of say 1 million rows (not sorted) involved?


Thank you.

tyagishaila
Specialist
Specialist

You can try it,

ProdMaxDate:

LOAD * INLINE [

    Product No, Date Created, Record No, Date Altered, Reason

    1, 4/8/2014, RN0001, 4/6/2014, Damaged

    1, 4/8/2014, RN0025, 4/6/2014, AutoRepaired

    1, 4/1/2015, RN9090, 4/11/2014, Damaged

    2, 5/9/2015, RN0002, 3/9/2015, Creation

    3, 6/8/2015, RN0003, 6/7/2015, Damaged

    3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

    3, 10/10/2015, RN0006, 9/9/2015, Damaged

];

LatestDate:

inner join

Load [Product No],

Date(max([Date Created])) as [Date Created]

Resident ProdMaxDate

group by [Product No];

Not applicable
Author

Hi as from previous top replies, this wont work because if

3, 6/8/2015, RN0003, 6/7/2015, Damaged

    3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

    3, 6/8/2015, RN0006, 9/9/2015, Damaged

This happens, then it wont get the correct one cause all the date created are the same.

Do you have one that takes max of datecreated, max of recordNo and max of date altered?

For example for cases like this,

3, 6/8/2015, RN0003, 6/7/2015, Damaged

3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

3, 6/8/2015, RN0006, 6/7/2015, Damaged

So is there a script that takes into account all possible scenarios?

sunny_talwar

Did you try this already?

Table:

LOAD [Product No],

  FirstSortedValue([Date Created], -([Date Created]+RecNo())) as [Date Created],

  FirstSortedValue([Record No], -([Date Created]+RecNo())) as [Record No],

  FirstSortedValue([Date Altered], -([Date Created]+RecNo())) as [Date Altered],

  FirstSortedValue(Reason, -([Date Created]+RecNo())) as Reason

Group By [Product No];

LOAD * INLINE [

    Product No, Date Created, Record No, Date Altered, Reason

    1, 4/8/2014, RN0001, 4/6/2014, Damaged

    1, 4/8/2014, RN0025, 4/6/2014, AutoRepaired

    1, 4/8/2015, RN9090, 4/11/2014, Damaged

    2, 5/9/2015, RN0002, 3/9/2015, Creation

    3, 6/8/2015, RN0003, 6/7/2015, Damaged

    3, 6/8/2015, RN0004, 6/7/2015, AutoRepaired

    3, 10/10/2015, RN0006, 9/9/2015, Damaged

];

sunny_talwar

RecNo() is the order of your rows. So if you see it in table form, you will see this:

    

Product NoDate CreatedRecord NoDate AlteredReasonRecNo
14/8/2014RN00014/6/2014Damaged1
14/8/2014RN00254/6/2014AutoRepaired2
14/1/2015RN90904/11/2014Damaged3
25/9/2015RN00023/9/2015Creation4
36/8/2015RN00036/7/2015Damaged5
36/8/2015RN00046/7/2015AutoRepaired6
310/10/2015RN00069/9/2015Damaged7

So I am sorting by [Date Created] + RecNo(). So when they are all the same the one with the highest RecNo() row will be selected.

Not applicable
Author

Hi SunnyT,

Yes i did try on a small set of my data and it works, Not sure if it is the same on larger scale,

I just would like to understand something. I understand what you explained.

   

Product NoDate CreatedRecord NoDate AlteredReasonRecNo
14/8/2014RN00014/6/2014Damaged1
14/8/2014RN00254/6/2014AutoRepaired2
14/1/2015RN90904/11/2014Damaged3
25/9/2015RN00023/9/2015Creation4
36/8/2015RN00036/7/2015Damaged5
36/8/2015RN00046/7/2015AutoRepaired6
310/10/2015RN00069/9/2015Damaged7

The data above is already pretty much sorted.

What happens if the data is like this?

  

Product NoDate CreatedRecord NoDate AlteredReason
14/8/2014RN00014/6/2014Damaged
14/8/2014RN00954/6/2014AutoRepaired
14/8/2015RN00354/11/2014Damaged
36/8/2015RN00086/7/2015Damaged
36/8/2015RN00046/7/2015AutoRepaired
36/8/2015RN00069/9/2015Damaged

This set of data is unsorted, as you can see the latest record no is the 2nd row for product no 1, and for product no 3, the latest record no is the 4th row.

How would your script handle this as we didnt mention how the data of record no and date altered will be sorted.

Does firstsortedvalue sort unspecified columns(since recordno and date altered are not mentioned)

Thank you so much SunnyT, Appreciate all your replies as it really improved my learning.

Oh another ques, Is it possible to do [Date Created]+[Record No]+[Date Altered] instead of [Date Created] + RecNo() ?

I tried it, it didnt work of course.