Qlik Community

QlikView Scripting

Discussion Board for collaboration on QlikView Scripting.

Announcements

Breathe easy -- you now have more time to plan your next steps with Qlik!
QlikView 11.2 Extended Support is now valid through December 31, 2020. Click here for more information.

Not applicable

Sample Sets of Data

I'm looking to create a flag that randomly selects 5% of my unique identifiers and flags them as a "Trial Group". See my code below

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

I'm uncertain if this is doing what I want it to do. At a quick look it does, but I'm a little confused on how this is randomly selecting 5% of unique identifiers without referencing the unique identifier in this line of code??

9 Replies
adamdavi3s
Honored Contributor

Re: Sample Sets of Data

hmmm ignore what I said, I think your method seems to be the accepted solution!

adamdavi3s
Honored Contributor

Re: Sample Sets of Data

based on my testing this doesn't produce an accurate 5% figure, on 20,000 rows I got figures between 964 and 1037 over 5 reloads.

How precious are you with it being exactly 5%?

Not applicable

Re: Sample Sets of Data

Ok thank you. So does my code randomly select 5% of unique identifiers or 5% of all data. In my mind, those two things are different. I would want to capture the former. Perhaps, this is a misunderstanding of what this function returns?

adamdavi3s
Honored Contributor

Re: Sample Sets of Data

This will flag *roughly* 5% of your data in the table.

Can you share any data at all? or at least your column names and expected results, this problem has nagged me for ages so I will find a solution!

adamdavi3s
Honored Contributor

Re: Sample Sets of Data

Here is my working, sorry im just heading out of the office but will pick up in the morning.


This is selecting an EXACT 5% sample, not just a rough one, i.e. in my data 1093 rows every time

Sorry its a work in progress, need to tidy it all up

Not applicable

Re: Sample Sets of Data

I cannot share any data. However, the column names is not an issue.

Sequence is my unique identifier (customer)

All other fields are related to this unique identifier(customer attributes)

I want to randomly select 5% of customers (5% of Sequences) and flag them as "Trial Group" and leave the others as "Standard Group"

adamdavi3s
Honored Contributor

Re: Sample Sets of Data

Hi Seth,

I am working on this, but decided to make a blog of it as its vexed me for ages, so I am doing some proper testing.

So far I am happy with the results though

Capture.PNG

adamdavi3s
Honored Contributor

Re: Sample Sets of Data

OK so the simple answer is:

1- *i assume* your current set up will be loading 5% of all of your transactions not customers

2- the current setup is not completely accurate so if you want your sample tolerance to be within +/- 0.05% then check out my blog (Accurately selecting a random percentage sample)

If you're happy with a tolerance of what appears to be +/- 1% then you can do something like the below.

If your current code is say:

fact:

Load

transactionid,

sequence,

somotherdata,

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

from blah;

Then this will be a 5% (ish) sample of all data.

To make this be a 5% sample of sequence then something like this needs to be done (and you could also left join if you wanted) and apologies if my syntax is out anywhere I have written this on the fly:

tmp:

load distinct

sequence

from blah;

noconcatenate

tmp2:

mapping load

sequence

if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

resident tmp;


drop table tmp;


fact:

Load

transactionid,

sequence,

somotherdata,

applymap('tmp2',source,null()) as [Trial Flag]

from blah;

Re: Sample Sets of Data

If you want to randomly select 5% of unique Customer names, you can use this in your script:

CustomerTypes:

LOAD FieldValue('CustomerName', RowNo()) AS CustomerName,

     if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

AUTOGENERATE FieldValueCount('CustomerName');

You can use whatever field in whatever facts table to extract only distinct values. Of course, the same would be true if you use a LOAD DISTINCT, but this code loads from the symbol table, which may be faster.

Peter