Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
The table I have looks like this:
Product ID | Date | Year |
PMX99 | 2022-02-29 | 2022 |
PMX99 | 2021-12-08 | 2021 |
ORA31 | 2022-05-05 | 2022 |
ORA31 | 2021-09-23 | 2021 |
I want to sort descending by Date, select only distinct Product ID, and return the Product ID and Year. The output would look like this (expected)
Product ID | Year |
PMX99 | 2022 |
ORA31 | 2022 |
The code I have is
TempTable:
LOAD
"Product ID",
"Date",
"Year"
FROM [lib://excel.xlsx (ABC)](ooxml, embedded labels, table is [Sheet1]);
MappingTable:
LOAD
DISTINCT "Product ID",
"Year"
RESIDENT TempTable
ORDER BY "Product ID", "Date" DESC;
DROP TABLE TempTable;
Which produces (actual)
Product ID | Year |
PMX99 | 2022 |
PMX99 | 2021 |
ORA31 | 2022 |
ORA31 | 2021 |
Why are there duplicates in Product ID? How do I get the output I expect?
Hi
Distinct is used to decorate Load but not a specific Field.
In your case,each ID matched two different Date in two different year.That means for combination of ID and Year,your record is unique.
If one ID match several different dates in same year,you will got several same records after you load ProductID and Year without distinct.Only in this situation,you need add distinct to your code.
@zhaofeng Your example is very helpful to show why this is happening. Thank you. I'm not sure where to add DISTINCT to my code to correct the issue. Can you show me where to add DISTINCT?
Distinct always added after Load.Your code is correct.
LOAD DISTINCT
"Product ID",
"Year"
RESIDENT TempTable
ORDER BY "Product ID", "Date" DESC;
The output is also correct.However it is different from your expection.I think you may want to load the latest Year for each ID.If that,you need to write the following code.
Load
"Product ID",
max(Year) as Year
Resident TempTable
group by
"Product ID"
;
@zhaofeng Thanks I'll try that. It's slightly more complicated though. In the actual table I need to take the MAX(Date) and return the Year from that row (along with the Product ID). In the actual data set the Year is inconsistent, but the Date is always correct. So I have to use the Date column as a MAX and then grab whatever Year is there (sometimes Year is "2023" sometimes it's "TwentyTwentyThree", etc.).
Yep.The first thing you need to do is unify your data format.Such as modify "Twenty Twenty Three" to 2023.