Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi everyone,
I have to load a file, with the following structure (is a sample, there are millions of records):
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Luke Hughes LA Park
Andy Land NO Work Office
Sam Jones LA Drive
Sam Jones LA Work
Sam Jones LA Work
Sam Jones LA
Sam Jones LA Work Home
.
.
.
When loading these records, I want to use NAME + SURNAME as a Key, so I want only one record for each NAME+SURNAME combination. In the case of multiple records with the same NAME+SURNAME combination, I want to keep only the record with ACTION or LASTPOSITION field not null. In the case mentioned before, I want the following as result:
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Andy Land NO Work Office
Sam Jones LA Work Home
How can I achieve this result? I would appreciate the code to use. Thanks a lot.
can u try like
where not isnull(ACTION) or not isnull( LASTPOSITION)
Hi
try this
[Table]:
LOAD * INLINE
[
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
](delimiter is ',');
final:
NoConcatenate
load *
resident Table
where len(trim(ACTION))>0 and len(trim(LASTPOSITION))>0 ;
drop table Table;
Resulting Table :
I would try the following code, which basically starts from @brunobertels's solution and adds those NAME&SURNAME that only appears without both ACTION and LASTPOSITION.
Still not sure, though, whether you need/want to treat NAME&SURNAME like a key and therefore enforce unique values.
FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');
FACTS_1:
NOCONCATENATE LOAD DISTINCT
NAME&SURNAME as key,
*
RESIDENT FACTS_0
WHERE
(not(isnull(ACTION)) and trim(ACTION)<>'')
or
(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;
CONCATENATE (FACTS_1)
LOAD DISTINCT *
RESIDENT FACTS_0
WHERE
not exists(key,NAME&SURNAME)
and (
(isnull(ACTION) or trim(ACTION)='')
and
(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
)
;
DROP TABLE FACTS_0;
Hi everyone. My table is not to be inserted manually, the values I mentioned were only an example, so I'm searching for something like this:
Load Distinct Name, Surname... and so on.
In addition, I would like a solution which optimize the usage of data, because I'm dealing with heavy data. So, if the condition of selection could be set before loading all the data, it would be better.
Yes, I need only one occurrence for the combination NAME - SURNAME. There are no records with NAME and SURNAME both null, while there are lots of them with this combination > 1. In this case (NAME-SURNAME>1), I need only the record with ACTION or LASTPOSITION not null.
Another possible solution is to aggregate grouping by NAME and SURNAME, so that you end up with a table that has exactly one row per distinct value of NAME&SURNAME.
LOAD
NAME,
SURNAME,
maxstring(CITY) as CITY,
maxstring(ACTION) as ACTION,
maxstring(LASTPOSITION) as LASTPOSITION
FROM lib://<your-file-connection-name>/file.csv (ansi, txt, delimiter is ',', embedded labels);
GROUP BY
NAME, SURNAME;
Please, note that this might create records that do not exists in the original file. If this is not what you need, the only way is to work with DISTINCT and filters, but this means you have to think of a multi-step data cleansing strategy, rather than relying on a single LOAD/SELECT.
Sorry, I need help again. I notice that in my dataset, there is also a case like this (new data are bolded):
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Luke Hughes LA Park
Andy Land NO Work Office
Sam Jones LA Drive
Sam Jones LA Work
Sam Jones LA Work
Sam Jones LA
Sam Jones LA Work Home
Phil Terry SA
Phil Terry SA
.
.
.
In this case, both of the two last records have their field ACTION and LASTPOSITION empty. I wanna keep only one of them, removing duplicates. Unfortunately, with the Where condition above discussed (action and lastposition not null), both of these records will be cut off because ACTION and LASTPOSITION are null. Is there a way to keep this condition, but avoid it when the duplicates records have both the fields null? This is the final table I would achieve:
NAME SURNAME CITY ACTION LASTPOSITION
Mark Pitt NY Train Gym
Luke Hughes LA Play Park
Andy Land NO Work Office
Sam Jones LA Work Home
Phil Terry SA
So, basically, you are saying that for each NAME/SURNAME you only need one record, regardless of the values in ACTION and LASTPOSITION. In fact, you consider good rows those having ACTION or LASTPOSITION (thus also both ACTION and LASTPOSITION), and those without ACTION and LASTPOSITION.
This can be achieved this way
FACTS_0:
LOAD *,
rowno() as [Row Number];
LOAD * INLINE [
NAME, SURNAME, CITY, ACTION, LASTPOSITION
Mark, Pitt, NY, Train, Gym
Luke, Hughes, LA, Play, Park
Luke, Hughes, LA, , Park
Andy, Land, NO, Work, Office
Sam, Jones, LA, Drive,
Sam, Jones, LA, Work,
Sam, Jones, LA, Work,
Sam, Jones, LA,,
Sam, Jones, LA, Work, Home
Phil, Terry, SA,,
Phil, Terry, SA,,
];
FACTS_1:
INNER KEEP (FACTS_0)
LOAD
NAME,
SURNAME,
max([Row Number]) as [Row Number]
RESIDENT FACTS_0
GROUP BY NAME, SURNAME;
DROP TABLE FACTS_1;
Please, note that:
Hi, for my purpose, it was almost right your previous code, adding only one thing:
FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');
FACTS_1:
NOCONCATENATE LOAD DISTINCT
NAME&SURNAME as key,
*
RESIDENT FACTS_0
WHERE
(not(isnull(ACTION)) and trim(ACTION)<>'')
or
(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;
CONCATENATE (FACTS_1)
LOAD DISTINCT
NAME&SURNAME as key,
*
RESIDENT FACTS_0
WHERE
not exists(key,NAME&SURNAME)
and (
(isnull(ACTION) or trim(ACTION)='')
and
(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
)
;
DROP TABLE FACTS_0;
I did not want to give no key to unique records with ACTION and LASTPOSITION null, so putting the key also in that part, this issue has been solved. In this way, also, the duplicate record having a null value in ACTION or LASTPOSITION are removed. Thanks a lot!