Qlik Community

New to QlikView

Discussion board where members can get started with QlikView.

Not applicable

Multi-dimensional regression analysis

So I've spent the better part of Saturday reading about aggregations in QV, and am beginning to think it may not be possible to reference a current iterated field value in a nested set expression. Any comments are much appreciated.

Input table (simplified, Month 4 ends next week so Sales values are not yet available):

MonthCountryCitySalesTrend
1FranceRouen1010
1FranceNice55
2FranceRouen1111
1SpainBarcelona22
2SpainBarcelona44
1SpainPamplona2020
3SpainBarcelona66
2SpainPamplona2323
3SpainPamplona2626
2FranceNice1010
3FranceRouen1212
3FranceNice1515
4FranceRouen-13
4FranceNice-20
4SpainBarcelona-8
4SpainPamplona-29

Goal: Define an expression column (Trend) that computes the linear regression for each <Month, Country, City>.

The Linest_m() & Linest_b() functions appear to be constrained to a single dimension, and it's easy to compute the trend for any one dimension (Month for example) as:

Linest_m(TOTAL Aggr(Sales, Month, Country, City), Month) * Month +

Linest_b(TOTAL Aggr(Sales, Month, Country, City), Month)

Reading up on Set Analysis and Nested Aggregations, I feel a proper solution might look something like this:

Aggr(

     (Linest_m({<Country={???}, City={???}>} TOTAL Aggr(Sales, Month, Country, City), Month)*Month +

     Linest_b({<Country={???}, City={???}>} TOTAL Aggr(Sales, Month, Country, City), Month)),

Month, Country, City)

Where the ??? would be a reference to the current Country or City iteration value from the outer Aggr expansion.

What am I missing?

-Matt

1 Solution

Accepted Solutions
MVP
MVP

Re: Multi-dimensional regression analysis

Matt,

I do get the requested results in a straight table chart with dimensions Country, City, Month and an expression like

Linest_m(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month)

7 Replies

Re: Multi-dimensional regression analysis

Why not do it in the load script ?

Data :

LOAD Month,

     Country,

     City,

     Sales,

     Trend as [Sample Trend]

FROM

[https://community.qlik.com/thread/163359]

(html, codepage is 1252, embedded labels, table is @1);

Months:

Load

  recno() as Month

AutoGenerate 12

;

Outer Join ( Months )

Load distinct

     Country,

     City

resident Data

;

join ( Data )

load

  *

resident Months

;

drop table Months ;

Left Join (Data)

Load

City,

linest_m(Sales , Month) as M

resident Data

group by City;

Left Join (Data)

Load

City,

linest_b(Sales , Month) as B

resident Data

group by City;

right Join (Data)

load

  * ,

  ( M * Month ) + B as Trend

resident Data

;

Not applicable

Re: Multi-dimensional regression analysis

Thanks Bill, there are three reasons I've avoided this so far:

1. My load scripts are primarily oriented toward loading data via SQL, it would be nice to have a clean separation of concerns, with scripts doing data extraction and report objects doing analytics.

2. My load scripts tend to take up to 20 minutes to execute, and we add more queries on a regular basis. Coding and debugging this there would be more labor intensive and time consuming than leveraging report objects.

3. Am still new to QV, and have not had the time yet to study the scripting capability to the extent needed to do something like this.

Is scripting the only option here?

Re: Multi-dimensional regression analysis

Matt

I disagree with the ethos of "scripts doing data extraction and report objects doing analytics."

Dashboard object expressions are evaluated dynamically at run time and slow down the End User dashboard performance.

Script transformations are run in batch and do not slow down End User dashboard performance, indeed the opposite as by removing them from the dashboard GUI, the End User dashboard performance will improve.

I would venture to guess for your 20 minute load script run that the vast majority of the time is spent waiting for data to arrive from the database via SQL statements, as databases are normally on discs which have noticeable latency. QlikView scripts multi-thread well at run time, so the additional script processing will probably be done whilst the SQL extract thread is waiting on the disc latency - so it may well not increase your script run time noticeably.

Notwithstanding that it could probably done in a front end expression, but I would avoid doing that.

Not applicable

Re: Multi-dimensional regression analysis

Good points. It must be possible to separate concerns at the script level as well, it's really the time required to load the data and test changes that I am worried about here. If performance becomes an issue in the UI, we'll look to move more of the analytics to the scripting layer as you suggest. Still curious though whether this problem has a solution at the report object level.

MVP
MVP

Re: Multi-dimensional regression analysis

Matt,

I do get the requested results in a straight table chart with dimensions Country, City, Month and an expression like

Linest_m(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Country,City> Aggr(Sales, Country, City, Month), Month)

Not applicable

Re: Multi-dimensional regression analysis

Interesting swuehl, this is the correct answer. One thing I don't understand: What lead me down this path initially was that it seemed necessary to include the first dimension in the <> qualifier as:

Linest_m(TOTAL<Month, Country, City> Aggr(Sales, Country, City, Month), Month) * Month +

Linest_b(TOTAL<Month, Country, City> Aggr(Sales, Country, City, Month), Month)

Which yields a null result. Simply excluding that first dimension from the qualifier as in your example solves the problem. As a new user this is unintuitive, why does including the full description of a record's identity in the qualifier fail?

Thanks!

MVP
MVP

Re: Multi-dimensional regression analysis

I think that would translate to something like 'calculate the linear regression (dependent = Sales, independent = Month) per Month, Country and City' and it's hard to do that with singular data points.