Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi All,
I have dataset like below
ID | Date | Value |
1 | 01/01/2024 | 449 |
2 | 02/01/2024 | 76 |
3 | 03/01/2024 | 211 |
4 | 04/01/2024 | 402 |
5 | 05/01/2024 | 197 |
6 | 06/01/2024 | 485 |
7 | 07/01/2024 | 168 |
8 | 08/01/2024 | 55 |
9 | 09/01/2024 | 41 |
10 | 10/01/2024 | 154 |
11 | 11/01/2024 | 391 |
Within this dataset I want to create a Category column in which random values should be populated. There will be fixed three values High,Medium & Low. High values in data should 50%, Medium values should be 30% and Low values should be 20%.
In the above dataset, I have 11 records, so High values should be 50% of 11 which is 6 (rounding), Medium values should be 30% of 11 which is 3 and Low values should be 20% of 11 which is 2
So output should be below. Note that values are not required to be populated as shown as it could be random(anywhere). Only important point is that High values should be 6, Medium=3 & Low=2 based on records and required % of values
ID | Date | Value | Category |
1 | 01/01/2024 | 449 | High |
2 | 02/01/2024 | 76 | High |
3 | 03/01/2024 | 211 | Low |
4 | 04/01/2024 | 402 | High |
5 | 05/01/2024 | 197 | Medium |
6 | 06/01/2024 | 485 | Low |
7 | 07/01/2024 | 168 | High |
8 | 08/01/2024 | 55 | High |
9 | 09/01/2024 | 41 | High |
10 | 10/01/2024 | 154 | Medium |
11 | 11/01/2024 | 391 | Medium |
Note: Although Number of Category values remain same, I want them on different position randomly on every reload.
@rob_vander In that case you can try below
Data:
LOAD *,
Rand() as RandomKey
INLINE [
ID Date Value
1 01/01/2024 449
2 02/01/2024 76
3 03/01/2024 211
4 04/01/2024 402
5 05/01/2024 197
6 06/01/2024 485
7 07/01/2024 168
8 08/01/2024 55
9 09/01/2024 41
10 10/01/2024 154
11 11/01/2024 100
12 09/01/2024 41
13 10/01/2024 154
14 11/01/2024 100
15 10/01/2024 154
](delimiter is '\t');
let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);
Final:
NoConcatenate
Load *,
if(RowNo()<=$(vHighRecords),'High',
if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data
Order by RandomKey;
Drop Table Data;
@rob_vander try below
Data:
LOAD * INLINE [
ID Date Value
1 01/01/2024 449
2 02/01/2024 76
3 03/01/2024 211
4 04/01/2024 402
5 05/01/2024 197
6 06/01/2024 485
7 07/01/2024 168
8 08/01/2024 55
9 09/01/2024 41
10 10/01/2024 154
11 11/01/2024 100
12 09/01/2024 41
13 10/01/2024 154
14 11/01/2024 100
15 10/01/2024 154
](delimiter is '\t');
let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);
Final:
NoConcatenate
Load *,
if(RowNo()<=$(vHighRecords),'High',
if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data;
Drop Table Data;
@Kushal_Chawda Thanks for the response. But here category values will be at fixed position although number of values populated are correct. What I want is values should be at different position on each reload. Could you help me on that part? I have updated original post
@rob_vander In that case you can try below
Data:
LOAD *,
Rand() as RandomKey
INLINE [
ID Date Value
1 01/01/2024 449
2 02/01/2024 76
3 03/01/2024 211
4 04/01/2024 402
5 05/01/2024 197
6 06/01/2024 485
7 07/01/2024 168
8 08/01/2024 55
9 09/01/2024 41
10 10/01/2024 154
11 11/01/2024 100
12 09/01/2024 41
13 10/01/2024 154
14 11/01/2024 100
15 10/01/2024 154
](delimiter is '\t');
let vRecords =NoOfRows('Data');
Let vHighRecords = round(vRecords*0.5);
Let vMediumRecords = round(vRecords*0.3);
Let vLowRecords = round(vRecords*0.2);
Final:
NoConcatenate
Load *,
if(RowNo()<=$(vHighRecords),'High',
if(RowNo()<=$(vHighRecords)+$(vMediumRecords),'Medium',
if(RowNo()<=$(vRecords),'Low'))) as Category
Resident Data
Order by RandomKey;
Drop Table Data;
To get a more random assigning of the categories against the record-position you may apply an order by against n fields and/or a specialized created one in the origin load, for example per: rand() as SortKey.
Thinkable are also solutions without a record-counting and ordering, maybe with something like:
load *, pick(mod(rowno(), 10) + 1, 'High', 'Medium', 'Low', 'High', 'High', 'Low', ...) as X
from Y;
@maxgro thanks. second solution works
@Kushal_Chawda This works
@marcus_sommer Thanks. But this is just example. In real scenario I could have million records. So it will be difficult with pick match statements.
It's no match() only a pick() which means the rowno() evaluation happens only once and this result picked the referenced value from the list. It should be not slower as the if-loop approaches.
Especially because you have millions of records and all of the approaches are quite simple it wouldn't cost much efforts to test the different suggestions in regard to the run-time within your environment.