Skip to main content
Announcements
Global Transformation Awards! Applications are now open. Submit Entry
cancel
Showing results for 
Search instead for 
Did you mean: 
BS
Contributor III
Contributor III

"where" clause to show the most recent record among duplicates

I have a series of Excels that I have to load in a QVD. These excels are extractions from a ticketing tool. This extraction is done every week and stored in different excels. The problem is that, if a ticket has been modified (for example, its Status changes from "In execution" to "Completed") in the extraction, this ticket returns with the same ID.

In the excels I also have a field called "Updated" that records the day of the modification. In my QVD I need to get ONLY the ticket that has been modified most recently. This is the code section where I load the excels:

PeticionesTareasTME:
LOAD *
FROM
[$(vPathCSVTLF)/*.xlsx]
(ooxml, embedded labels);

My idea was to make a "where" clause similar to the following:

where if (Exists (ID), max (Updated)) // If you find a repeated ID among the excels, return the one with the most recent "Updated" date field.

This sentence has not worked for me but that is what I want to do.

I hope you can help me.

1 Solution

Accepted Solutions
asinha1991
Creator III
Creator III

for identical, this will remove duplicates

PeticionesTareasTME:
LOAD *,rowNo() as RNO
FROM
[$(vPathCSVTLF)/*.xlsx]
(ooxml, embedded labels);

inner join(PeticionesTareasTME)

Load ID,max(updated)as updated Resident PeticionesTareasTME Group by ID;

inner join(PeticionesTareasTME)

Load ID,max(RNO)as RNO Resident PeticionesTareasTME Group by ID;

for non identical you will need to decide logic

 

 

View solution in original post

15 Replies
StarinieriG
Partner - Specialist
Partner - Specialist

Hi,

you could try to sort data by ID and Updated (desc), then use firstsortedvalue

asinha1991
Creator III
Creator III

I don't know how many columns you have hence wont suggest group by all , you can do this

 

PeticionesTareasTME:
LOAD *
FROM
[$(vPathCSVTLF)/*.xlsx]
(ooxml, embedded labels);

inner join(PeticionesTareasTME)

Load ID,max(updated)as updated Resident PeticionesTareasTME Group by ID;

 

BS
Contributor III
Contributor III
Author

It did take some duplicates away but not all of them.

Thank you so much anyway.

Any other ideas?

StarinieriG
Partner - Specialist
Partner - Specialist

Maybe using right join with the same logic suggested by @asinha1991 

PeticionesTareasTME:
LOAD *
FROM
[$(vPathCSVTLF)/*.xlsx]
(ooxml, embedded labels);

Right Join

LOAD 

ID,

Max(Updated) as Updated

Resident PeticionesTareasTME;

 

asinha1991
Creator III
Creator III

you can try what @StarinieriG  had  suggested

but as per my view, both inner join and right join should work, can you share data where you see mismatch?

I think there is  some kind of issue with grouping (looks like what seems to be same ID are creating multiple groups, maybe you have space in some of them?)

 

BS
Contributor III
Contributor III
Author

Here's the results i'm getting ("#" = ID and "Actualizado" = Updated).

https://snipboard.io/YZ836f.jpg (sorry for the external link, I'm getting an error trying to insert a picture)

As you can see, I get the record twice with the two dates of the updated field, when I really only need to get the most recent one.

 

BS
Contributor III
Contributor III
Author

wait, in the same image you can see that there are identical records in both ID and Updated. I think the code you suggested is generating duplicates, maybe?

asinha1991
Creator III
Creator III

are there duplicates in source data?

asinha1991
Creator III
Creator III

try this once and also add trim()  in ID just to make sure
PeticionesTareasTME:
LOAD *,rowNo() as RNO,num(updated) as updated_num
FROM
[$(vPathCSVTLF)/*.xlsx]
(ooxml, embedded labels);
inner join(PeticionesTareasTME)
Load ID, num(max(updated)) as updated_num,max(RNO) as RNO Resident PeticionesTareasTME Group by ID;