topic Re: Sample Sets of Data in QlikView

Sample Sets of Data

Thu, 16 Feb 2017 15:03:00 GMT

I'm looking to create a flag that randomly selects 5% of my unique identifiers and flags them as a "Trial Group". See my code below

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

I'm uncertain if this is doing what I want it to do. At a quick look it does, but I'm a little confused on how this is randomly selecting 5% of unique identifiers without referencing the unique identifier in this line of code??

Re: Sample Sets of Data

adamdavi3s — Thu, 16 Feb 2017 15:06:27 GMT

hmmm ignore what I said, I think your method seems to be the accepted solution!

Re: Sample Sets of Data

adamdavi3s — Thu, 16 Feb 2017 15:24:11 GMT

based on my testing this doesn't produce an accurate 5% figure, on 20,000 rows I got figures between 964 and 1037 over 5 reloads.

How precious are you with it being exactly 5%?

Re: Sample Sets of Data

Thu, 16 Feb 2017 15:29:16 GMT

Ok thank you. So does my code randomly select 5% of unique identifiers or 5% of all data. In my mind, those two things are different. I would want to capture the former. Perhaps, this is a misunderstanding of what this function returns?

Re: Sample Sets of Data

adamdavi3s — Thu, 16 Feb 2017 15:32:29 GMT

This will flag *roughly* 5% of your data in the table.

Can you share any data at all? or at least your column names and expected results, this problem has nagged me for ages so I will find a solution!

Re: Sample Sets of Data

adamdavi3s — Thu, 16 Feb 2017 15:38:47 GMT

Here is my working, sorry im just heading out of the office but will pick up in the morning.

This is selecting an EXACT 5% sample, not just a rough one, i.e. in my data 1093 rows every time

Sorry its a work in progress, need to tidy it all up

Re: Sample Sets of Data

Thu, 16 Feb 2017 15:40:03 GMT

I cannot share any data. However, the column names is not an issue.

Sequence is my unique identifier (customer)

All other fields are related to this unique identifier(customer attributes)

I want to randomly select 5% of customers (5% of Sequences) and flag them as "Trial Group" and leave the others as "Standard Group"

Re: Sample Sets of Data

adamdavi3s — Fri, 17 Feb 2017 09:05:23 GMT

Hi Seth,

I am working on this, but decided to make a blog of it as its vexed me for ages, so I am doing some proper testing.

So far I am happy with the results though

Re: Sample Sets of Data

adamdavi3s — Fri, 17 Feb 2017 10:40:12 GMT

OK so the simple answer is:

1- *i assume* your current set up will be loading 5% of all of your transactions not customers

2- the current setup is not completely accurate so if you want your sample tolerance to be within +/- 0.05% then check out my blog (Accurately selecting a random percentage sample)

If you're happy with a tolerance of what appears to be +/- 1% then you can do something like the below.

If your current code is say:

fact:

Load

transactionid,

sequence,

somotherdata,

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

from blah;

Then this will be a 5% (ish) sample of all data.

To make this be a 5% sample of sequence then something like this needs to be done (and you could also left join if you wanted) and apologies if my syntax is out anywhere I have written this on the fly:

tmp:

load distinct

sequence

from blah;

noconcatenate

tmp2:

mapping load

sequence

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

resident tmp;

drop table tmp;

fact:

Load

transactionid,

sequence,

somotherdata,

applymap('tmp2',source,null()) as [Trial Flag]

from blah;

Re: Sample Sets of Data

Peter_Cammaert — Fri, 17 Feb 2017 14:37:36 GMT

If you want to randomly select 5% of unique Customer names, you can use this in your script:

CustomerTypes:
LOAD FieldValue('CustomerName', RowNo()) AS CustomerName,
     if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag] 
AUTOGENERATE FieldValueCount('CustomerName');

You can use whatever field in whatever facts table to extract only distinct values. Of course, the same would be true if you use a LOAD DISTINCT, but this code loads from the symbol table, which may be faster.

Peter