Skip to main content
Announcements
Global Transformation Awards! Applications are now open. Submit Entry
cancel
Showing results for 
Search instead for 
Did you mean: 
rob_vander
Creator
Creator

Randomly populate category values in data based on percenatge

Hi All,

I have dataset like below

ID Date Value
1 01/01/2024 449
2 02/01/2024 76
3 03/01/2024 211
4 04/01/2024 402
5 05/01/2024 197
6 06/01/2024 485
7 07/01/2024 168
8 08/01/2024 55
9 09/01/2024 41
10 10/01/2024 154
11 11/01/2024 391

 

Within this dataset I want to create a Category column in which random values should be populated. There will be fixed three values High,Medium & Low.  High values in data should 50%, Medium values should be 30% and Low values should be 20%.

In the above dataset, I have 11 records, so High values should be 50% of 11 which is 6 (rounding), Medium values should be 30% of 11 which is 3 and Low values should be 20% of 11 which is 2

So output should be below. Note that values are not required to be populated as shown as it could be random(anywhere). Only important point is that High values should be 6, Medium=3 & Low=2 based on records and required % of values

ID Date Value Category
1 01/01/2024 449 High
2 02/01/2024 76 High
3 03/01/2024 211 Low
4 04/01/2024 402 High
5 05/01/2024 197 Medium
6 06/01/2024 485 Low
7 07/01/2024 168 High
8 08/01/2024 55 High
9 09/01/2024 41 High
10 10/01/2024 154 Medium
11 11/01/2024 391 Medium

 

Note: Although Number of Category values remain same, I want them on different position randomly on every reload.

Labels (3)
2 Solutions

Accepted Solutions
maxgro
MVP
MVP

T:
Load * inline [
ID, Date, Value,
1, 01/01/2024, 449
2, 02/01/2024, 76
3, 03/01/2024, 211
4, 04/01/2024, 402
5, 05/01/2024, 197
6, 06/01/2024, 485
7, 07/01/2024, 168
8, 08/01/2024, 55
9, 09/01/2024 ,41
10, 10/01/2024 ,154
11, 11/01/2024 ,391
];
 
// first solution
LEFT JOIN (T) LOAD 
ID, 
if(rand()<= .5, 'High', if(rand()<= (0.5 + 0.3), 'Medium', 'Low')) as Category 
Resident T;
 
 
 
// second solution, more accurate
 
LET vNum = NoOfRows('T');
LET vNumHigh = round(vNum * 0.5);
LET vNumMedium = round(vNum * 0.3);
LET vNumLow = vNum - vNumHigh - vNumMedium;
 
LEFT JOIN (T) LOAD
ID,
rand() as Sort
Resident T;
 
U:
NOCONCATENATE LOAD
*,
rowno(),
if(rowno()<= $(vNumHigh), 'High', if(rowno()<= ($(vNumHigh) + $(vNumMedium)), 'Medium', 'Low')) as Category2  
Resident T
order by Sort;
 
DROP TABLE T;
 
maxgro_0-1732797307206.png

 

View solution in original post

Kushal_Chawda

@rob_vander  In that case you can try below

Data:
LOAD *,
     Rand() as RandomKey
INLINE [
ID	Date	Value
1	01/01/2024	449
2	02/01/2024	76
3	03/01/2024	211
4	04/01/2024	402
5	05/01/2024	197
6	06/01/2024	485
7	07/01/2024	168
8	08/01/2024	55
9	09/01/2024	41
10	10/01/2024	154
11	11/01/2024	100
12	09/01/2024	41
13	10/01/2024	154
14	11/01/2024	100
15	10/01/2024	154

](delimiter is '\t');

let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);

Final:
NoConcatenate
Load *,
     if(RowNo()<=$(vHighRecords),'High',
     if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
     if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data
Order by RandomKey;

Drop Table Data;

View solution in original post

9 Replies
Kushal_Chawda

@rob_vander  try below

Data:
LOAD * INLINE [
ID	Date	Value
1	01/01/2024	449
2	02/01/2024	76
3	03/01/2024	211
4	04/01/2024	402
5	05/01/2024	197
6	06/01/2024	485
7	07/01/2024	168
8	08/01/2024	55
9	09/01/2024	41
10	10/01/2024	154
11	11/01/2024	100
12	09/01/2024	41
13	10/01/2024	154
14	11/01/2024	100
15	10/01/2024	154

](delimiter is '\t');

let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);

Final:
NoConcatenate
Load *,
     if(RowNo()<=$(vHighRecords),'High',
     if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
     if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data;

Drop Table Data;

 

Screenshot 2024-11-28 at 12.20.49.png

rob_vander
Creator
Creator
Author

@Kushal_Chawda  Thanks for the response. But here category values will be at fixed position although number of values populated are correct. What I want is values should be at different position on each reload. Could you help me on that part? I have updated original post

maxgro
MVP
MVP

T:
Load * inline [
ID, Date, Value,
1, 01/01/2024, 449
2, 02/01/2024, 76
3, 03/01/2024, 211
4, 04/01/2024, 402
5, 05/01/2024, 197
6, 06/01/2024, 485
7, 07/01/2024, 168
8, 08/01/2024, 55
9, 09/01/2024 ,41
10, 10/01/2024 ,154
11, 11/01/2024 ,391
];
 
// first solution
LEFT JOIN (T) LOAD 
ID, 
if(rand()<= .5, 'High', if(rand()<= (0.5 + 0.3), 'Medium', 'Low')) as Category 
Resident T;
 
 
 
// second solution, more accurate
 
LET vNum = NoOfRows('T');
LET vNumHigh = round(vNum * 0.5);
LET vNumMedium = round(vNum * 0.3);
LET vNumLow = vNum - vNumHigh - vNumMedium;
 
LEFT JOIN (T) LOAD
ID,
rand() as Sort
Resident T;
 
U:
NOCONCATENATE LOAD
*,
rowno(),
if(rowno()<= $(vNumHigh), 'High', if(rowno()<= ($(vNumHigh) + $(vNumMedium)), 'Medium', 'Low')) as Category2  
Resident T
order by Sort;
 
DROP TABLE T;
 
maxgro_0-1732797307206.png

 

Kushal_Chawda

@rob_vander  In that case you can try below

Data:
LOAD *,
     Rand() as RandomKey
INLINE [
ID	Date	Value
1	01/01/2024	449
2	02/01/2024	76
3	03/01/2024	211
4	04/01/2024	402
5	05/01/2024	197
6	06/01/2024	485
7	07/01/2024	168
8	08/01/2024	55
9	09/01/2024	41
10	10/01/2024	154
11	11/01/2024	100
12	09/01/2024	41
13	10/01/2024	154
14	11/01/2024	100
15	10/01/2024	154

](delimiter is '\t');

let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);

Final:
NoConcatenate
Load *,
     if(RowNo()<=$(vHighRecords),'High',
     if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
     if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data
Order by RandomKey;

Drop Table Data;
marcus_sommer

To get a more random assigning of the categories against the record-position you may apply an order by against n fields and/or a specialized created one in the origin load, for example per: rand() as SortKey.

Thinkable are also solutions without a record-counting and ordering, maybe with something like:

load *, pick(mod(rowno(), 10) + 1, 'High', 'Medium', 'Low', 'High', 'High', 'Low', ...) as X
from Y;

rob_vander
Creator
Creator
Author

@maxgro  thanks. second solution works

rob_vander
Creator
Creator
Author

@Kushal_Chawda This works

rob_vander
Creator
Creator
Author

@marcus_sommer  Thanks. But this is just example. In real scenario I could have million records. So it will be difficult with pick match statements.

marcus_sommer

It's no match() only a pick() which means the rowno() evaluation happens only once and this result picked the referenced value from the list. It should be not slower as the if-loop approaches.

Especially because you have millions of records and all of the approaches are quite simple it wouldn't cost much efforts to test the different suggestions in regard to the run-time within your environment.