Skip to main content
Announcements
Global Transformation Awards! Applications are now open. Submit Entry
cancel
Showing results for 
Search instead for 
Did you mean: 
SerSwagster
Creator
Creator

Aggregating and keeping only the prevalent records

Hi everyone. I have loaded this input data:

Users:

Name     City

Tom    Toronto

Tom   New York   

Nick  Philadelphia

Nick  Los Angeles

Nick  Philadelphia

Sam   New York

Sam   Chicago

Sam  New York

Sam   Chicago

John  Phoenix

Vincent  San Antonio

Vincent  San Antonio

Bob   Miami

Bob   Boston

I want to reload these data in another table, aggregating by Name field, keeping only the record with most City occurrences, and otherwise, in case of the same number of occurrences for both Cities, to keep the distinct records, without repetition. I show you the result I expect:

Name     City

Tom    Toronto

Tom   New York   

Nick  Philadelphia

Sam   New York

Sam   Chicago

John  Phoenix

Vincent  San Antonio

Bob   Miami

Bob   Boston

What is the code to achieve this result? I thought something like this:

Load Name,

if(not isnull(mode(City)) -> mode(City)

if(isnull(mode(City)) -> all the distinct records

Resident Users

Group by Name;

Any suggestions?

Labels (3)
1 Solution

Accepted Solutions
Or
MVP
MVP

This is a mockup and probably won't work as is, but the general gist is:

Table1:

Load City, Name, count(1)  as Count

From YourTable

Group by City, Name;

Map:

Mapping Load Name, Count

Where RowNumber=1;

Load City, Name, Count, Autonumber(City, Name) as RowNumber

Order by Count desc;

Resident Table1;

Final:

Load Name, City

Resident Table1

Where Count =  ApplyMap('Map', Name);

Drop Table Table1;

 

The added complexity with the mapping load and additional final load is to deal with ties, if you just wanted the single biggest (with ties being broken arbitrarily or using another field) you could just use the row=1 approach directly.

View solution in original post

7 Replies
Or
MVP
MVP

There are probably many ways to do this, and this might not be the most efficient, but...

1) Load the city, name, and count of lines for each combination

2) For each name load the maximum number of lines using group by name, might be easiest to use a mapping load here

3) From the first table, load only name/city combinations where the count matches the maximum number from the previous load (2)

I think this approach should get you what you're after without having to use loops, which would likely cause poor performance (depending on the data volume). As I said, there may be a simpler approach with fewer steps, but this is what comes to mind.

SerSwagster
Creator
Creator
Author

This script will fit within a huge script that already takes more than 5 minutes to be executed. So I'm searching for, if it is possible, the best solution in order to not put additional burdens on my actual script. Do you think that your solution is the best in terms of impact on loading times?

Or
MVP
MVP

I doubt it's the best, but it shouldn't be a problem unless you're iterating over many, many millions of lines for City and Name. Everything would be done using Resident loads and it's only two fields, so I wouldn't expect much of a negative impact under normal circumstances.

SerSwagster
Creator
Creator
Author

Can you write here, briefly, the script of your solution? It takes into account also the cases in which there are two cities with the same number of records for a certain name?

Or
MVP
MVP

This is a mockup and probably won't work as is, but the general gist is:

Table1:

Load City, Name, count(1)  as Count

From YourTable

Group by City, Name;

Map:

Mapping Load Name, Count

Where RowNumber=1;

Load City, Name, Count, Autonumber(City, Name) as RowNumber

Order by Count desc;

Resident Table1;

Final:

Load Name, City

Resident Table1

Where Count =  ApplyMap('Map', Name);

Drop Table Table1;

 

The added complexity with the mapping load and additional final load is to deal with ties, if you just wanted the single biggest (with ties being broken arbitrarily or using another field) you could just use the row=1 approach directly.

SerSwagster
Creator
Creator
Author

It works, thank you!

One last question, to understand better your procedure: what is the specific step in which are kept only the combinations with most occurrences? Because the usage of RowNumber is not so clear for me, since I see sometimes the same combination, with the same count, with RowNumber = 1 and other times = 2.

Anyway, the problem is solved.

Or
MVP
MVP

This is the condition in the final load. We previously created a table that contains each name and the matching greatest count. It doesn't care which counties are associated with number, just what the number is. We then look for any name/country combination that matches that number and keep all of them. A bit tricky, and possibly not the simplest way to achieve this, but I was thinking it'd work.