Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
 SerSwagster
		
			SerSwagster
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi everyone,
I have to load a file, with the following structure (is a sample, there are millions of records):
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Luke Hughes LA Park
Andy Land NO Work Office
Sam Jones LA Drive
Sam Jones LA Work
Sam Jones LA Work
Sam Jones LA
Sam Jones LA Work Home
.
.
.
When loading these records, I want to use NAME + SURNAME as a Key, so I want only one record for each NAME+SURNAME combination. In the case of multiple records with the same NAME+SURNAME combination, I want to keep only the record with ACTION or LASTPOSITION field not null. In the case mentioned before, I want the following as result:
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Andy Land NO Work Office
Sam Jones LA Work Home
How can I achieve this result? I would appreciate the code to use. Thanks a lot.
 anat
		
			anat
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		can u try like
where not isnull(ACTION) or not isnull( LASTPOSITION)
 
					
				
		
 brunobertels
		
			brunobertels
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi
try this
[Table]:
LOAD * INLINE 
[
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
](delimiter is ',');
final:
NoConcatenate
load *
resident Table
where len(trim(ACTION))>0 and len(trim(LASTPOSITION))>0 ;
drop table Table;
Resulting Table :
 AG-gugelbisolut
		
			AG-gugelbisolutI would try the following code, which basically starts from @brunobertels's solution and adds those NAME&SURNAME that only appears without both ACTION and LASTPOSITION.
Still not sure, though, whether you need/want to treat NAME&SURNAME like a key and therefore enforce unique values.
FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');
FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;
CONCATENATE (FACTS_1)
LOAD DISTINCT *
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;
DROP TABLE FACTS_0;
 SerSwagster
		
			SerSwagster
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi everyone. My table is not to be inserted manually, the values I mentioned were only an example, so I'm searching for something like this:
Load Distinct Name, Surname... and so on.
In addition, I would like a solution which optimize the usage of data, because I'm dealing with heavy data. So, if the condition of selection could be set before loading all the data, it would be better.
 SerSwagster
		
			SerSwagster
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Yes, I need only one occurrence for the combination NAME - SURNAME. There are no records with NAME and SURNAME both null, while there are lots of them with this combination > 1. In this case (NAME-SURNAME>1), I need only the record with ACTION or LASTPOSITION not null.
 AG-gugelbisolut
		
			AG-gugelbisolutAnother possible solution is to aggregate grouping by NAME and SURNAME, so that you end up with a table that has exactly one row per distinct value of NAME&SURNAME.
LOAD
	NAME,
	SURNAME,
	maxstring(CITY) as CITY,
	maxstring(ACTION) as ACTION,
	maxstring(LASTPOSITION) as LASTPOSITION
FROM lib://<your-file-connection-name>/file.csv (ansi, txt, delimiter is ',', embedded labels);
GROUP BY
	NAME, SURNAME;Please, note that this might create records that do not exists in the original file. If this is not what you need, the only way is to work with DISTINCT and filters, but this means you have to think of a multi-step data cleansing strategy, rather than relying on a single LOAD/SELECT.
 SerSwagster
		
			SerSwagster
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Sorry, I need help again. I notice that in my dataset, there is also a case like this (new data are bolded):
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Luke Hughes LA Park
Andy Land NO Work Office
Sam Jones LA Drive
Sam Jones LA Work
Sam Jones LA Work
Sam Jones LA
Sam Jones LA Work Home
Phil Terry SA
Phil Terry SA
.
.
.
In this case, both of the two last records have their field ACTION and LASTPOSITION empty. I wanna keep only one of them, removing duplicates. Unfortunately, with the Where condition above discussed (action and lastposition not null), both of these records will be cut off because ACTION and LASTPOSITION are null. Is there a way to keep this condition, but avoid it when the duplicates records have both the fields null? This is the final table I would achieve:
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Andy Land NO Work Office
Sam Jones LA Work Home
Phil Terry SA
 AG-gugelbisolut
		
			AG-gugelbisolutSo, basically, you are saying that for each NAME/SURNAME you only need one record, regardless of the values in ACTION and LASTPOSITION. In fact, you consider good rows those having ACTION or LASTPOSITION (thus also both ACTION and LASTPOSITION), and those without ACTION and LASTPOSITION.
This can be achieved this way
FACTS_0:
LOAD *,
	rowno() as [Row Number];
LOAD * INLINE [
NAME,	SURNAME,	CITY,	ACTION,	LASTPOSITION
Mark,	Pitt,       NY,	    Train,  Gym 
Luke,	Hughes,		LA,     Play,   Park
Luke,	Hughes,     LA,		,		Park
Andy,	Land,       NO,     Work,   Office
Sam,	Jones,      LA,     Drive,          
Sam,	Jones,      LA,		Work,
Sam,	Jones,      LA,     Work,
Sam,	Jones,      LA,,
Sam,	Jones,      LA,		Work,	Home
Phil,	Terry,      SA,,
Phil,	Terry,      SA,,
];
FACTS_1:
INNER KEEP (FACTS_0)
LOAD
	NAME,
   	SURNAME,
    max([Row Number]) as [Row Number]
RESIDENT FACTS_0
GROUP BY NAME, SURNAME;
DROP TABLE FACTS_1;Please, note that:
 SerSwagster
		
			SerSwagster
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi, for my purpose, it was almost right your previous code, adding only one thing:
FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');
FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;
CONCATENATE (FACTS_1)
LOAD DISTINCT
NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;
DROP TABLE FACTS_0;I did not want to give no key to unique records with ACTION and LASTPOSITION null, so putting the key also in that part, this issue has been solved. In this way, also, the duplicate record having a null value in ACTION or LASTPOSITION are removed. Thanks a lot!
