Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi everyone. I have loaded this input data:
Users:
Name City
Tom Toronto
Tom New York
Nick Philadelphia
Nick Los Angeles
Nick Philadelphia
Sam New York
Sam Chicago
Sam New York
Sam Chicago
John Phoenix
Vincent San Antonio
Vincent San Antonio
Bob Miami
Bob Boston
I want to reload these data in another table, aggregating by Name field, keeping only the record with most City occurrences, and otherwise, in case of the same number of occurrences for both Cities, to keep the distinct records, without repetition. I show you the result I expect:
Name City
Tom Toronto
Tom New York
Nick Philadelphia
Sam New York
Sam Chicago
John Phoenix
Vincent San Antonio
Bob Miami
Bob Boston
What is the code to achieve this result? I thought something like this:
Load Name,
if(not isnull(mode(City)) -> mode(City)
if(isnull(mode(City)) -> all the distinct records
Resident Users
Group by Name;
Any suggestions?
This is a mockup and probably won't work as is, but the general gist is:
Table1:
Load City, Name, count(1) as Count
From YourTable
Group by City, Name;
Map:
Mapping Load Name, Count
Where RowNumber=1;
Load City, Name, Count, Autonumber(City, Name) as RowNumber
Order by Count desc;
Resident Table1;
Final:
Load Name, City
Resident Table1
Where Count = ApplyMap('Map', Name);
Drop Table Table1;
The added complexity with the mapping load and additional final load is to deal with ties, if you just wanted the single biggest (with ties being broken arbitrarily or using another field) you could just use the row=1 approach directly.
There are probably many ways to do this, and this might not be the most efficient, but...
1) Load the city, name, and count of lines for each combination
2) For each name load the maximum number of lines using group by name, might be easiest to use a mapping load here
3) From the first table, load only name/city combinations where the count matches the maximum number from the previous load (2)
I think this approach should get you what you're after without having to use loops, which would likely cause poor performance (depending on the data volume). As I said, there may be a simpler approach with fewer steps, but this is what comes to mind.
This script will fit within a huge script that already takes more than 5 minutes to be executed. So I'm searching for, if it is possible, the best solution in order to not put additional burdens on my actual script. Do you think that your solution is the best in terms of impact on loading times?
I doubt it's the best, but it shouldn't be a problem unless you're iterating over many, many millions of lines for City and Name. Everything would be done using Resident loads and it's only two fields, so I wouldn't expect much of a negative impact under normal circumstances.
Can you write here, briefly, the script of your solution? It takes into account also the cases in which there are two cities with the same number of records for a certain name?
This is a mockup and probably won't work as is, but the general gist is:
Table1:
Load City, Name, count(1) as Count
From YourTable
Group by City, Name;
Map:
Mapping Load Name, Count
Where RowNumber=1;
Load City, Name, Count, Autonumber(City, Name) as RowNumber
Order by Count desc;
Resident Table1;
Final:
Load Name, City
Resident Table1
Where Count = ApplyMap('Map', Name);
Drop Table Table1;
The added complexity with the mapping load and additional final load is to deal with ties, if you just wanted the single biggest (with ties being broken arbitrarily or using another field) you could just use the row=1 approach directly.
It works, thank you!
One last question, to understand better your procedure: what is the specific step in which are kept only the combinations with most occurrences? Because the usage of RowNumber is not so clear for me, since I see sometimes the same combination, with the same count, with RowNumber = 1 and other times = 2.
Anyway, the problem is solved.
This is the condition in the final load. We previously created a table that contains each name and the matching greatest count. It doesn't care which counties are associated with number, just what the number is. We then look for any name/country combination that matches that number and keep all of them. A bit tricky, and possibly not the simplest way to achieve this, but I was thinking it'd work.