Re: Sample Sets of Data - Qlik Community

Report Inappropriate Content · ‎2017-02-16

I'm looking to create a flag that randomly selects 5% of my unique identifiers and flags them as a "Trial Group". See my code below

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

I'm uncertain if this is doing what I want it to do. At a quick look it does, but I'm a little confused on how this is randomly selecting 5% of unique identifiers without referencing the unique identifier in this line of code??

adamdavi3s · ‎2017-02-16

hmmm ignore what I said, I think your method seems to be the accepted solution!

adamdavi3s · ‎2017-02-16

based on my testing this doesn't produce an accurate 5% figure, on 20,000 rows I got figures between 964 and 1037 over 5 reloads.

How precious are you with it being exactly 5%?

Report Inappropriate Content · ‎2017-02-16

Ok thank you. So does my code randomly select 5% of unique identifiers or 5% of all data. In my mind, those two things are different. I would want to capture the former. Perhaps, this is a misunderstanding of what this function returns?

adamdavi3s · ‎2017-02-16

This will flag *roughly* 5% of your data in the table.

Can you share any data at all? or at least your column names and expected results, this problem has nagged me for ages so I will find a solution!

adamdavi3s · ‎2017-02-16

Here is my working, sorry im just heading out of the office but will pick up in the morning.

This is selecting an EXACT 5% sample, not just a rough one, i.e. in my data 1093 rows every time

Sorry its a work in progress, need to tidy it all up

Report Inappropriate Content · ‎2017-02-16

I cannot share any data. However, the column names is not an issue.

Sequence is my unique identifier (customer)

All other fields are related to this unique identifier(customer attributes)

I want to randomly select 5% of customers (5% of Sequences) and flag them as "Trial Group" and leave the others as "Standard Group"

adamdavi3s · ‎2017-02-17

Hi Seth,

I am working on this, but decided to make a blog of it as its vexed me for ages, so I am doing some proper testing.

So far I am happy with the results though

adamdavi3s · ‎2017-02-17

OK so the simple answer is:

1- *i assume* your current set up will be loading 5% of all of your transactions not customers

2- the current setup is not completely accurate so if you want your sample tolerance to be within +/- 0.05% then check out my blog (Accurately selecting a random percentage sample)

If you're happy with a tolerance of what appears to be +/- 1% then you can do something like the below.

If your current code is say:

fact:

Load

transactionid,

sequence,

somotherdata,

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

from blah;

Then this will be a 5% (ish) sample of all data.

To make this be a 5% sample of sequence then something like this needs to be done (and you could also left join if you wanted) and apologies if my syntax is out anywhere I have written this on the fly:

tmp:

load distinct

sequence

from blah;

noconcatenate

tmp2:

mapping load

sequence

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

resident tmp;

drop table tmp;

fact:

Load

transactionid,

sequence,

somotherdata,

applymap('tmp2',source,null()) as [Trial Flag]

from blah;

Peter_Cammaert · ‎2017-02-17

If you want to randomly select 5% of unique Customer names, you can use this in your script:

CustomerTypes:
LOAD FieldValue('CustomerName', RowNo()) AS CustomerName,
     if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag] 
AUTOGENERATE FieldValueCount('CustomerName');

You can use whatever field in whatever facts table to extract only distinct values. Of course, the same would be true if you use a LOAD DISTINCT, but this code loads from the symbol table, which may be faster.

Peter