Solved: Multi-dimensional regression analysis - Qlik Community

Report Inappropriate Content · ‎2015-05-10

So I've spent the better part of Saturday reading about aggregations in QV, and am beginning to think it may not be possible to reference a current iterated field value in a nested set expression. Any comments are much appreciated.

Input table (simplified, Month 4 ends next week so Sales values are not yet available):

Month	Country	City	Sales	Trend
1	France	Rouen	10	10
1	France	Nice	5	5
2	France	Rouen	11	11
1	Spain	Barcelona	2	2
2	Spain	Barcelona	4	4
1	Spain	Pamplona	20	20
3	Spain	Barcelona	6	6
2	Spain	Pamplona	23	23
3	Spain	Pamplona	26	26
2	France	Nice	10	10
3	France	Rouen	12	12
3	France	Nice	15	15
4	France	Rouen	-	13
4	France	Nice	-	20
4	Spain	Barcelona	-	8
4	Spain	Pamplona	-	29

Goal: Define an expression column (Trend) that computes the linear regression for each <Month, Country, City>.

The Linest_m() & Linest_b() functions appear to be constrained to a single dimension, and it's easy to compute the trend for any one dimension (Month for example) as:

Linest_m(TOTAL Aggr(Sales, Month, Country, City), Month) * Month +

Linest_b(TOTAL Aggr(Sales, Month, Country, City), Month)

Reading up on Set Analysis and Nested Aggregations, I feel a proper solution might look something like this:

Aggr(

(Linest_m({<Country={???}, City={???}>} TOTAL Aggr(Sales, Month, Country, City), Month)*Month +

Linest_b({<Country={???}, City={???}>} TOTAL Aggr(Sales, Month, Country, City), Month)),

Month, Country, City)

Where the ??? would be a reference to the current Country or City iteration value from the outer Aggr expansion.

What am I missing?

-Matt

swuehl · ‎2015-05-10

Matt,

I do get the requested results in a straight table chart with dimensions Country, City, Month and an expression like

Linest_m(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month)

View solution in original post

Anonymous · ‎2015-05-10

Why not do it in the load script ?

Data :

LOAD Month,

Country,

City,

Sales,

Trend as [Sample Trend]

FROM

[https://community.qlik.com/thread/163359]

(html, codepage is 1252, embedded labels, table is @1);

Months:

Load

recno() as Month

AutoGenerate 12

;

Outer Join ( Months )

Load distinct

Country,

City

resident Data

;

join ( Data )

load

*

resident Months

;

drop table Months ;

Left Join (Data)

Load

City,

linest_m(Sales , Month) as M

resident Data

group by City;

Left Join (Data)

Load

City,

linest_b(Sales , Month) as B

resident Data

group by City;

right Join (Data)

load

* ,

( M * Month ) + B as Trend

resident Data

;

Report Inappropriate Content · ‎2015-05-10

Thanks Bill, there are three reasons I've avoided this so far:

1. My load scripts are primarily oriented toward loading data via SQL, it would be nice to have a clean separation of concerns, with scripts doing data extraction and report objects doing analytics.

2. My load scripts tend to take up to 20 minutes to execute, and we add more queries on a regular basis. Coding and debugging this there would be more labor intensive and time consuming than leveraging report objects.

3. Am still new to QV, and have not had the time yet to study the scripting capability to the extent needed to do something like this.

Is scripting the only option here?

Anonymous · ‎2015-05-10

Matt

I disagree with the ethos of "scripts doing data extraction and report objects doing analytics."

Dashboard object expressions are evaluated dynamically at run time and slow down the End User dashboard performance.

Script transformations are run in batch and do not slow down End User dashboard performance, indeed the opposite as by removing them from the dashboard GUI, the End User dashboard performance will improve.

I would venture to guess for your 20 minute load script run that the vast majority of the time is spent waiting for data to arrive from the database via SQL statements, as databases are normally on discs which have noticeable latency. QlikView scripts multi-thread well at run time, so the additional script processing will probably be done whilst the SQL extract thread is waiting on the disc latency - so it may well not increase your script run time noticeably.

Notwithstanding that it could probably done in a front end expression, but I would avoid doing that.

Report Inappropriate Content · ‎2015-05-10

Good points. It must be possible to separate concerns at the script level as well, it's really the time required to load the data and test changes that I am worried about here. If performance becomes an issue in the UI, we'll look to move more of the analytics to the scripting layer as you suggest. Still curious though whether this problem has a solution at the report object level.

swuehl · ‎2015-05-10

Matt,

I do get the requested results in a straight table chart with dimensions Country, City, Month and an expression like

Linest_m(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month)

Report Inappropriate Content · ‎2015-05-10

Interesting swuehl, this is the correct answer. One thing I don't understand: What lead me down this path initially was that it seemed necessary to include the first dimension in the <> qualifier as:

Linest_m(TOTAL<Month, Country, City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Month, Country, City> Aggr(Sales, Country, City, Month), Month)

Which yields a null result. Simply excluding that first dimension from the qualifier as in your example solves the problem. As a new user this is unintuitive, why does including the full description of a record's identity in the qualifier fail?

Thanks!

swuehl · ‎2015-05-10

I think that would translate to something like 'calculate the linear regression (dependent = Sales, independent = Month) per Month, Country and City' and it's hard to do that with singular data points.