Re: Loading distinct records with conditions - Qlik Community

SerSwagster · ‎2022-02-09

Hi everyone,

I have to load a file, with the following structure (is a sample, there are millions of records):

NAME SURNAME CITY ACTION LASTPOSITION

Mark Pitt NY Train Gym

Luke Hughes LA Play Park

Luke Hughes LA Park

Andy Land NO Work Office

Sam Jones LA Drive

Sam Jones LA Work

Sam Jones LA

Sam Jones LA Work Home

.

When loading these records, I want to use NAME + SURNAME as a Key, so I want only one record for each NAME+SURNAME combination. In the case of multiple records with the same NAME+SURNAME combination, I want to keep only the record with ACTION or LASTPOSITION field not null. In the case mentioned before, I want the following as result:

NAME SURNAME CITY ACTION LASTPOSITION

Mark Pitt NY Train Gym

Luke Hughes LA Play Park

Andy Land NO Work Office

Sam Jones LA Work Home

How can I achieve this result? I would appreciate the code to use. Thanks a lot.

anat · ‎2022-02-09

can u try like

where not isnull(ACTION) or not isnull( LASTPOSITION)

brunobertels · ‎2022-02-09

Hi

try this

[Table]:
LOAD * INLINE
[
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
](delimiter is ',');

final:
NoConcatenate
load *
resident Table
where len(trim(ACTION))>0 and len(trim(LASTPOSITION))>0 ;

drop table Table;

Resulting Table :

AG-gugelbisolutions · ‎2022-02-09

I would try the following code, which basically starts from @brunobertels's solution and adds those NAME&SURNAME that only appears without both ACTION and LASTPOSITION.

Still not sure, though, whether you need/want to treat NAME&SURNAME like a key and therefore enforce unique values.

FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');

FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;

CONCATENATE (FACTS_1)
LOAD DISTINCT *
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;

DROP TABLE FACTS_0;

SerSwagster · ‎2022-02-09

Hi everyone. My table is not to be inserted manually, the values I mentioned were only an example, so I'm searching for something like this:

Load Distinct Name, Surname... and so on.

In addition, I would like a solution which optimize the usage of data, because I'm dealing with heavy data. So, if the condition of selection could be set before loading all the data, it would be better.

SerSwagster · ‎2022-02-09

Yes, I need only one occurrence for the combination NAME - SURNAME. There are no records with NAME and SURNAME both null, while there are lots of them with this combination > 1. In this case (NAME-SURNAME>1), I need only the record with ACTION or LASTPOSITION not null.

AG-gugelbisolutions · ‎2022-02-09

Another possible solution is to aggregate grouping by NAME and SURNAME, so that you end up with a table that has exactly one row per distinct value of NAME&SURNAME.

LOAD
	NAME,
	SURNAME,
	maxstring(CITY) as CITY,
	maxstring(ACTION) as ACTION,
	maxstring(LASTPOSITION) as LASTPOSITION
FROM lib://<your-file-connection-name>/file.csv (ansi, txt, delimiter is ',', embedded labels);
GROUP BY
	NAME, SURNAME;

Please, note that this might create records that do not exists in the original file. If this is not what you need, the only way is to work with DISTINCT and filters, but this means you have to think of a multi-step data cleansing strategy, rather than relying on a single LOAD/SELECT.

SerSwagster · ‎2022-02-10

Sorry, I need help again. I notice that in my dataset, there is also a case like this (new data are bolded):

NAME SURNAME CITY ACTION LASTPOSITION

Mark Pitt NY Train Gym

Luke Hughes LA Play Park

Luke Hughes LA Park

Andy Land NO Work Office

Sam Jones LA Drive

Sam Jones LA Work

Sam Jones LA

Sam Jones LA Work Home

Phil Terry SA

.

In this case, both of the two last records have their field ACTION and LASTPOSITION empty. I wanna keep only one of them, removing duplicates. Unfortunately, with the Where condition above discussed (action and lastposition not null), both of these records will be cut off because ACTION and LASTPOSITION are null. Is there a way to keep this condition, but avoid it when the duplicates records have both the fields null? This is the final table I would achieve:

NAME SURNAME CITY ACTION LASTPOSITION

Mark Pitt NY Train Gym

Luke Hughes LA Play Park

Andy Land NO Work Office

Sam Jones LA Work Home

Phil Terry SA

AG-gugelbisolutions · ‎2022-02-10

So, basically, you are saying that for each NAME/SURNAME you only need one record, regardless of the values in ACTION and LASTPOSITION. In fact, you consider good rows those having ACTION or LASTPOSITION (thus also both ACTION and LASTPOSITION), and those without ACTION and LASTPOSITION.

This can be achieved this way

FACTS_0:
LOAD *,
	rowno() as [Row Number];
LOAD * INLINE [
NAME,	SURNAME,	CITY,	ACTION,	LASTPOSITION
Mark,	Pitt,       NY,	    Train,  Gym 
Luke,	Hughes,		LA,     Play,   Park
Luke,	Hughes,     LA,		,		Park
Andy,	Land,       NO,     Work,   Office
Sam,	Jones,      LA,     Drive,          
Sam,	Jones,      LA,		Work,
Sam,	Jones,      LA,     Work,
Sam,	Jones,      LA,,
Sam,	Jones,      LA,		Work,	Home
Phil,	Terry,      SA,,
Phil,	Terry,      SA,,
];

FACTS_1:
INNER KEEP (FACTS_0)
LOAD
	NAME,
   	SURNAME,
    max([Row Number]) as [Row Number]
RESIDENT FACTS_0
GROUP BY NAME, SURNAME;

DROP TABLE FACTS_1;

Please, note that:

you can substitute the first load statement with a one that loads data from your dataset directly
- you only need to add Row Number (preceding load is an efficient way to do that)
- if you want/need, you can further filter your dataset within the preceding load statement
I chose to keep the records with the highest Row Number for each NAME/SURNAME
- INNER KEEP is efficient but it creates a new table that has to be dropped at the end

SerSwagster · ‎2022-02-10

Hi, for my purpose, it was almost right your previous code, adding only one thing:

FACTS_0:
LOAD * INLINE [
NAME,SURNAME,CITY,ACTION,LASTPOSITION
Mark,Pitt,NY,Train,Gym
Luke,Hughes,LA,Play,Park
Luke,Hughes,LA,,Park
Andy,Land,NO,Work,Office
Sam,Jones,LA,Drive,
Sam,Jones,LA,Work,
Sam,Jones,LA,Work,
Sam,Jones,LA,,
Sam,Jones,LA,Work,Home
John,Doe,,,
](delimiter is ',');

FACTS_1:
NOCONCATENATE LOAD DISTINCT
	NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE
	(not(isnull(ACTION)) and trim(ACTION)<>'')
	or
	(not(isnull(LASTPOSITION)) and trim(LASTPOSITION)<>'')
;

CONCATENATE (FACTS_1)
LOAD DISTINCT
NAME&SURNAME as key,
	*
RESIDENT FACTS_0
WHERE 
	not exists(key,NAME&SURNAME)
	and (
		(isnull(ACTION) or trim(ACTION)='')
		and
		(isnull(LASTPOSITION) or trim(LASTPOSITION)='')
	)
;

DROP TABLE FACTS_0;

I did not want to give no key to unique records with ACTION and LASTPOSITION null, so putting the key also in that part, this issue has been solved. In this way, also, the duplicate record having a null value in ACTION or LASTPOSITION are removed. Thanks a lot!