Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello everyone,
I want to calculate the correlation of two variables but I know that there exists a timeshift/delay between them.
An example data is the following:
DAY | VAR1 | VAR2 |
01/01/2002 | 1 | 2 |
02/01/2002 | 2 | 3 |
03/01/2002 | 3 | 4 |
04/01/2002 | 4 | 5 |
05/01/2002 | 5 | 4 |
06/01/2002 | 4 | 3 |
07/01/2002 | 3 | 2 |
08/01/2002 | 2 | 1 |
09/01/2002 | 1 | 2 |
10/01/2002 | 2 | 3 |
11/01/2002 | 3 | 4 |
12/01/2002 | 4 | 5 |
13/01/2002 | 5 | 4 |
14/01/2002 | 4 | 3 |
15/01/2002 | 3 | 2 |
16/01/2002 | 2 | 1 |
17/01/2002 | 1 | 2 |
18/01/2002 | 2 | 3 |
19/01/2002 | 3 | 4 |
20/01/2002 | 4 | 5 |
21/01/2002 | 5 | 4 |
22/01/2002 | 4 | 3 |
23/01/2002 | 3 | 2 |
24/01/2002 | 2 | 1 |
25/01/2002 | 1 | 2 |
If I use the traditional correlation, the outcome is 0.67 but i know that if i shift the first var by one day, the correlation will be 1.
Is there a way to calculate the "real" correlation (meaning 1) without knowing the time shift?
Thanks in advance,
Panagiotis
Oh, I see. This sounds like an A/B Testing or maybe a Multi-Armed Bandits if it is an ongoing campaign. Check Data Voyagers, I wrote a couple articles about it, using Qlik.
That sounds like a topic for you @igoralcantara
Regards,
Mark Costa
Read more at Data Voyagers - datavoyagers.net
Follow me on my LinkedIn | Know IPC Global at ipc-global.com
In that case I would use a technique called Cross-Correlation. You can also find the Optimal Shift, the shift that produces the highest correlation.
This is easily done in Python. In Qlik Script is not so simple, but in the UI, there are some ways. From the top of my head, remembering when I did this years ago, you can try something like:
Use Above(VAR1, 1) to shift VAR1 by one day backward.
Calculate the correlation between this shifted VAR1 and VAR2.
Repeat for different shifts until you find the maximum correlation.
This explains what a cross-correlation is:
What Is Cross-Correlation? Definition, How It's Used, and Example
Thanks a lot!
The Above() function works great in both the visualisation for visual overview and the correlation evaluation.
If the time shift is known it is the optimal solution.
As I understand If the time shift is unknown, there is not automatic solution inside Qlik.
We have to either test possible values and see the best result or do calculations before loading to Qlik, either by testing the values or calculating the Cross-Correlation, and feed the ready result to Qlik.
Am I right or is there another workaround?
What bothers me is that the two timeseries may be irrelevant in general but follow the above pattern only for a month. If we select that month as a timespan, the pre-calculated time-shift or Cross-Correlation will not be able to capture the temporal behaviour.
Maybe what you need is a Time Series instead.
What exactly are you trying to achieve? What business question are you answering?
I am looking into data including social media posts information and show viewership.
I want to check if a certain "kind" of posts drives the changes to viewership.
In a simplified example, whenever I post with positive reviews, the show viewership goes up and when I post with negative reviews, the viewership goes down.
But the effect of my reviews is not instant but comes with an unknown delay.
I want to check if there is a connection between various kinds of posts and show viewership, product sales etc.
The effect may be temporal (i.e. a certain campaign) and that is why I would prefer the analysis to be dynamic and not static (in python before loading to Qlik).
Oh, I see. This sounds like an A/B Testing or maybe a Multi-Armed Bandits if it is an ongoing campaign. Check Data Voyagers, I wrote a couple articles about it, using Qlik.
Hi im no champ in stats, but i would say if you see you have a linear relation VAR2 = k × VAR1 + b kind of y=mx+c we use it for linear geometry, if you could figure out constants k and b then you could do something like below:
Data:
LOAD DAY,
VAR1,
VAR2
FROM
[https://community.qlik.com/t5/App-Development/Correlation-with-timeshift/td-p/2488146]
(html, codepage is 1252, embedded labels, table is @1);
// Define constants for k and b
LET k = 2; // Example multiplier
LET b = 1; // Example bias
// Creating Var3 based on the formula
Qualify *;
d:
LOAD
DAY,
VAR1,
VAR2,
$(k) * VAR1 + $(b) AS VAR3 // This calculates Var3 as k * VAR1 + b
RESIDENT Data;
when you dont know the constansts(linear relation) You could reverse engineer to find of constants:
im sure it would be much more complex in real scenarios, but this gives u a basic understanding.
If you want to explore cross-correlation at various lags,
you might need a more complex setup (like creating lagged versions of your variables).
for simplicity:
Lagged Variables (if necessary):
If you were to create lagged versions of your variables (e.g., VAR1 shifted by one time period), you could do something like this in the script:
DataWithLag:
LOAD
DAY,
VAR1,
VAR3,
Previous(d.VAR1) AS d.Lagged_VAR1
RESIDENT d;
Very nice article and explanation of Multi-Armed Bandits.
I see great value in it after deciding the evaluation metric of each post or type of post (analogue to ad in your example).
Great and very analytic answer.
Solving the system of equations to find the time and scale shift is plausible outside Qlik and if that is the case, processing the data with a python library to calculate the actual Cross-Correlation coefficient like in signal theory would cover the real scenario.
Your idea of creating lagged Variables a priori is a way to go if the python processing is out of the question. Like a simplified Cross-Correlation, creating 10 (or more) different lagged versions of VAR1:
VAR1_1 = Above(VAR1, 1)
VAR1_2 = Above(VAR1, 2)
etc
and selecting as the Cross-Correlation approximation the higher correlation of VAR1_i to VAR2