9 Replies Latest reply: Feb 17, 2017 9:37 AM by Peter Cammaert RSS

    Sample Sets of Data

    Seth Rosenbauer

      I'm looking to create a flag that randomly selects 5% of my unique identifiers and flags them as a "Trial Group". See my code below

       

      if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

       

       

      I'm uncertain if this is doing what I want it to do. At a quick look it does, but I'm a little confused on how this is randomly selecting 5% of unique identifiers without referencing the unique identifier in this line of code??

        • Re: Sample Sets of Data
          Adam Davies

          hmmm ignore what I said, I think your method seems to be the accepted solution!

          • Re: Sample Sets of Data
            Adam Davies

            Here is my working, sorry im just heading out of the office but will pick up in the morning.


            This is selecting an EXACT 5% sample, not just a rough one, i.e. in my data 1093 rows every time

             

            Sorry its a work in progress, need to tidy it all up

            • Re: Sample Sets of Data
              Adam Davies

              Hi Seth,

               

              I am working on this, but decided to make a blog of it as its vexed me for ages, so I am doing some proper testing.

              So far I am happy with the results though

               

              Capture.PNG

              • Re: Sample Sets of Data
                Adam Davies

                OK so the simple answer is:

                 

                1- *i assume* your current set up will be loading 5% of all of your transactions not customers

                2- the current setup is not completely accurate so if you want your sample tolerance to be within +/- 0.05% then check out my blog (Accurately selecting a random percentage sample)

                 

                If you're happy with a tolerance of what appears to be +/- 1% then you can do something like the below.

                 

                If your current code is say:

                 

                fact:

                Load

                transactionid,

                sequence,

                somotherdata,

                if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

                from blah;

                 

                Then this will be a 5% (ish) sample of all data.

                 

                To make this be a 5% sample of sequence then something like this needs to be done (and you could also left join if you wanted) and apologies if my syntax is out anywhere I have written this on the fly:

                 

                 

                tmp:

                load distinct

                sequence

                from blah;

                 

                noconcatenate

                 

                tmp2:

                mapping load

                sequence

                if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag]

                resident tmp;


                drop table tmp;


                fact:

                Load

                transactionid,

                sequence,

                somotherdata,

                applymap('tmp2',source,null()) as [Trial Flag]

                from blah;

                • Re: Sample Sets of Data
                  Peter Cammaert

                  If you want to randomly select 5% of unique Customer names, you can use this in your script:

                   

                  CustomerTypes:
                  LOAD FieldValue('CustomerName', RowNo()) AS CustomerName,
                       if((rand()<=0.05+now()*0) = 0, 'Standard Group', 'Trial Group') as [Trial Flag] 
                  AUTOGENERATE FieldValueCount('CustomerName');
                  

                   

                  You can use whatever field in whatever facts table to extract only distinct values. Of course, the same would be true if you use a LOAD DISTINCT, but this code loads from the symbol table, which may be faster.

                   

                  Peter