Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
SerSwagster
Creator
Creator

Loading distinct records with conditions

Hi everyone,

I have to load a file, with the following structure (is a sample, there are millions of records):

NAME   SURNAME  CITY    ACTION   LASTPOSITION

Mark     Pitt               NY            Train          Gym 

Luke     Hughes       LA            Play            Park

Luke     Hughes       LA                                 Park

Andy    Land            NO          Work          Office

Sam     Jones          LA            Drive          

Sam     Jones          LA            Work          

Sam     Jones          LA            Work

Sam     Jones         LA

Sam     Jones         LA             Work         Home

.

.

.

When loading  these records, I want to use NAME + SURNAME as a Key, so I want only one record for each NAME+SURNAME combination. In the case of multiple records with the same NAME+SURNAME combination, I want to keep only the record with ACTION or LASTPOSITION field not null. In the case mentioned before, I want the following as result:

NAME   SURNAME  CITY     ACTION   LASTPOSITION

Mark     Pitt                NY           Train           Gym 

Luke     Hughes        LA            Play            Park

Andy    Land             NO          Work          Office   

Sam     Jones           LA           Work         Home

How can I achieve this result? I would appreciate the code to use. Thanks a lot.

9 Replies
anat
Master
Master

can u try like 

where not isnull(ACTION) or not isnull(   LASTPOSITION)

brunobertels
Master
Master

Hi 

try this 

[Table]:
LOAD * INLINE
[
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
](delimiter is ',');

final:
NoConcatenate
load *
resident Table
where len(trim(ACTION))>0 and len(trim(LASTPOSITION))>0 ;

drop table Table;

 

Resulting Table :

brunobertels_0-1644411861577.png

 

AG-gugelbisolutions
Creator II
Creator II

I would try the following code, which basically starts from @brunobertels's solution and adds those NAME&SURNAME that only appears without both ACTION and LASTPOSITION.

Still not sure, though, whether you need/want to treat NAME&SURNAME like a key and therefore enforce unique values. 

FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');

FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;

CONCATENATE (FACTS_1)
LOAD DISTINCT *
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;

DROP TABLE FACTS_0;

 

SerSwagster
Creator
Creator
Author

Hi everyone. My table is not to be inserted manually, the values I mentioned were only an example, so I'm searching for something like this:

Load Distinct Name, Surname... and so on.

In addition, I would like a solution which optimize the usage of data, because I'm dealing with heavy data. So, if the condition of selection could be set before loading all the data, it would be better.

SerSwagster
Creator
Creator
Author

Yes, I need only one occurrence for the combination NAME - SURNAME. There are no records with NAME and SURNAME both null, while there are lots of them with this combination > 1. In this case (NAME-SURNAME>1), I need only the record with ACTION or LASTPOSITION not null.

AG-gugelbisolutions
Creator II
Creator II

Another possible solution is to aggregate grouping by NAME and SURNAME, so that you end up with a table that has exactly one row per distinct value of NAME&SURNAME.

LOAD
	NAME,
	SURNAME,
	maxstring(CITY) as CITY,
	maxstring(ACTION) as ACTION,
	maxstring(LASTPOSITION) as LASTPOSITION
FROM lib://<your-file-connection-name>/file.csv (ansi, txt, delimiter is ',', embedded labels);
GROUP BY
	NAME, SURNAME;

Please, note that this might create records that do not exists in the original file. If this is not what you need, the only way is to work with DISTINCT and filters, but this means you have to think of a multi-step data cleansing strategy, rather than relying on a single LOAD/SELECT.

SerSwagster
Creator
Creator
Author

Sorry, I need help again. I notice that in my dataset, there is also a case like this (new data are bolded):

NAME   SURNAME  CITY    ACTION   LASTPOSITION

Mark     Pitt               NY            Train          Gym 

Luke     Hughes       LA            Play            Park

Luke     Hughes       LA                                 Park

Andy    Land            NO          Work          Office

Sam     Jones          LA            Drive          

Sam     Jones          LA            Work          

Sam     Jones          LA            Work

Sam     Jones         LA

Sam     Jones         LA             Work         Home

Phil      Terry          SA

Phil      Terry         SA

.

.

.

In this case, both of the two last records have their field ACTION and LASTPOSITION empty. I wanna keep only one of them, removing duplicates. Unfortunately, with the Where condition above discussed (action and lastposition not null), both of these records will be cut off because ACTION and LASTPOSITION are null. Is there a way to keep this condition, but avoid it when the duplicates records have both the fields null? This is the final table I would achieve:

NAME   SURNAME  CITY     ACTION   LASTPOSITION

Mark     Pitt                NY           Train           Gym 

Luke     Hughes        LA            Play            Park

Andy    Land             NO          Work          Office   

Sam     Jones           LA           Work         Home

Phil      Terry             SA

AG-gugelbisolutions
Creator II
Creator II

So, basically, you are saying that for each NAME/SURNAME you only need one record, regardless of the values in ACTION and LASTPOSITION. In fact, you consider good rows those having ACTION or LASTPOSITION (thus also both ACTION and LASTPOSITION), and those without ACTION and LASTPOSITION.

This can be achieved this way

FACTS_0:
LOAD *,
	rowno() as [Row Number];
LOAD * INLINE [
NAME,	SURNAME,	CITY,	ACTION,	LASTPOSITION
Mark,	Pitt,       NY,	    Train,  Gym 
Luke,	Hughes,		LA,     Play,   Park
Luke,	Hughes,     LA,		,		Park
Andy,	Land,       NO,     Work,   Office
Sam,	Jones,      LA,     Drive,          
Sam,	Jones,      LA,		Work,
Sam,	Jones,      LA,     Work,
Sam,	Jones,      LA,,
Sam,	Jones,      LA,		Work,	Home
Phil,	Terry,      SA,,
Phil,	Terry,      SA,,
];

FACTS_1:
INNER KEEP (FACTS_0)
LOAD
	NAME,
   	SURNAME,
    max([Row Number]) as [Row Number]
RESIDENT FACTS_0
GROUP BY NAME, SURNAME;

DROP TABLE FACTS_1;

Please,  note that:

  • you can substitute the first load statement with a one that loads data from your dataset directly
    • you only need to add Row Number (preceding load is an efficient way to do that)
    • if you want/need, you can further filter your dataset within the preceding load statement
  • I chose to keep the records with the highest Row Number for each NAME/SURNAME
    • INNER KEEP is efficient but it creates a new table that has to be dropped at the end
SerSwagster
Creator
Creator
Author

Hi, for my purpose, it was almost right your previous code, adding only one thing:

FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');

FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;

CONCATENATE (FACTS_1)
LOAD DISTINCT
NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;

DROP TABLE FACTS_0;

I did not want to give no key to unique records with ACTION and LASTPOSITION null, so putting the key also in that part, this issue has been solved. In this way, also, the duplicate record having a null value in ACTION or LASTPOSITION are removed. Thanks a lot!