Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Best way to create a generated column in a load statement based on ranges

I wish to calculate a extra column based on the value of another columen based on ranges.

Example:

TableA

Name  Price

Bob   500

Alice 234

Carl  134

Jane   40

...

...

Labels

Range                   Label

0   <= Price < 100      Low

100 <= Price < 300      Medium

300 <= Price < 1000     Quite High       

...

77000 <= Price < 100000 The Rest

I want to ad an expression in my load script so that I based on the value of Price get a new column called label, so that the resulting table is like this:


TableA

Name  Price label

Bob   500  Quite High

Alice 234  Medium

Carl  134  Medium

Jane   40  Low

...

...

So basically I want do do

LOAD

     Name,

     Price,

     <something with price to map to a label> as label

FROM

.....

Any suggestions on how to do this in a clean and efficient way?

Thanks in advance.

Andreas

1 Solution

Accepted Solutions
Not applicable
Author

Hello Andreas,

A possible solution would be by using the IntervalMatch as below.

Hope this helps.

Ranges:

LOAD * Inline [
Label, MinLimit, MaxLimit
Low, 0, 99
Medium, 100, 299
Quite High, 300, 999
]
;

Data:
LOAD * Inline [
Name, Price
Bob, 500
Alice, 234
Carl, 134
Jane, 0
]
;

Left Join(Data) IntervalMatch (Price) LOAD MinLimit, MaxLimit Resident Ranges;

Left Join(Data)
LOAD MinLimit, MaxLimit, Label Resident Ranges;

DROP Table Ranges;


View solution in original post

8 Replies
Not applicable
Author

Hello Andreas,

A possible solution would be by using the IntervalMatch as below.

Hope this helps.

Ranges:

LOAD * Inline [
Label, MinLimit, MaxLimit
Low, 0, 99
Medium, 100, 299
Quite High, 300, 999
]
;

Data:
LOAD * Inline [
Name, Price
Bob, 500
Alice, 234
Carl, 134
Jane, 0
]
;

Left Join(Data) IntervalMatch (Price) LOAD MinLimit, MaxLimit Resident Ranges;

Left Join(Data)
LOAD MinLimit, MaxLimit, Label Resident Ranges;

DROP Table Ranges;


Not applicable
Author

You can use IntervalMatch to get what you are looking for.  See attached file. Hope it helps!

rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

Here's another approach using MAPPING:

Ranges:

MAPPING LOAD

          Start+IterNo()-1,

          Label

while Start+IterNo()-1 <= Stop

;

LOAD * INLINE [

    Start, Stop, Label

    0, 99, Low

    100, 299, Medium

    300, 1000, Quite High

];

Map Label using Ranges;

Data:

LOAD *, Price as Label INLINE [

    Name, Price

    Bob, 500

    Alice, 234

    Carl, 134

    Jane, 40

];

-Rob

http://robwunderlich.com

Not applicable
Author

Thank you for your reply jp_bakhache.

This approach did indeed work. But the Data table did get two extra columns: MinLimit and MaxLimit, it is possible to avoid adding them to the resulting Data table.

Also: I fail to understand how this works. Is it possible to explain what the two Left Joins in the end actually does?

Not applicable
Author

Thank you for your reply rebeccad.

Unfortunately I can't open the file right now. Is it the same approach as the one from jp_bakhache?

Not applicable
Author

Thak you Rob for your reply.

Unfortunately my real world scenario includes values up to a billion, which would create a Ranges table with a billion rows, which of course is unfeasable. Both from a time and memory perspective.

Not applicable
Author

The IntervalMatch used without a join, will result in an additional table in the schema.

It's used to link the two tables Ranges and Data on field Price between MinLimit and MaxLimit (the usual "join" actually joins tables based on equality of values) and so by using the first join we get the related min and max limits into table Data.

The 2nd join will add the label to table Data based on the two fields MinLimit and MaxLimit.

The result of this will be one final table instead of 2.

As to removing the 2 additional fields from the final table you can add this line in the end:

DROP Fields MinLimit, MaxLimit;

Hope I was able to explain clearly this purpose .

I'd suggest you check also the help manual for IntervalMatch and its different uses. And you can also try different scenarios so you can fully experience the outcomes (using IntervalMatch with/without join, etc...).

Not applicable
Author

No problem.  My suggestion was similar but I dropped the mins and maxes, so you only show the labels.

Data:

LOAD * INLINE [

    Name, Price

    Bob, 500

    Alice, 234

    Carl, 134

    Jane, 40

]
;



Ranges:
LOAD * INLINE [

    Start, Stop, Label

    0, 99, Low

    100, 299, Medium

    300, 1000, Quite High

]
;



IntervalMatch(Price)

LEFT JOIN (Ranges)

LOAD Start, Stop Resident Ranges;

DROP Fields Start, Stop;