Extracting unique values for aggregation - patient... - Page 2 - Qlik Community

betthisisnttake · ‎2017-07-24

patient_id	Episode Number	diagnosis	total_cost
12a	1x	blah1	1000
12a	2y	blah2	1000
13b	4z	blah1	3000
14c	5x	blah1	2500
14c	6z	blah2	2500
14c	9x	blah3	2500
15d	8y	blah1	1000

Hi there,

I hope I have posted this in the right place. If not let me know.

I have a large file and I have recreated the format in the example above.

Each patient_id is unique in the real world but each patient may have more than one episode number (also unique) and within each episode they may have more than one diagnosis (not unique). The presence of a few diagnosis/episodes for each patient means that the unique patient_ID and the total_costs for each patient appear more than once in the flat file.

I need to create a measure that extracts the total cost for each unique patient_id so that I can create an overall average total cost per patient as a KPI. I then need to use this measure in a chart with diagnosis as the dimension so i can then see the average total cost per patient associated with each diagnosis (or other fields I have in the real table).

So basically i need to sum the total cost for unique episode numbers and then sum these answers for the unique patient id's

When I just use avg(total_cost) - the results are way to high

Apologies in advance if there is something simple I am missing here but I am struggling with this. As a first step I've tried

sum(aggr(max([Episode Number]), total_cost)/count(aggr(max([Episode Number]),distinct [Episode Number]))) in an effort to start by pick out the unique episode ids using max and use aggr to create a temp table and then sum the total costs over this. I am then trying to divide by the number of unique Episode Numbers by counting them. This expression is not throwing an error but it is giving me a blank chart.

On top of this, I just cannot get to the next level and figure out how to work the unique patient_id's into this expression so the average is actually based on these.

Any help much appreciated

kindest regards,

Louise

effinty2112 · ‎2017-07-25

Hi Louise,

I thought the Episode Numbers were unique. Why the emphasis then on distinct episode numbers ?

If unique then there would be a single cost associated with each one, no?

cheers

Andrew

betthisisnttake · ‎2017-07-25

Hi kushal -thanks for your time on this,

In the real file there is potential for more than one diagnosis being associated with each episode for each patient (that is the only relevance of diagnosis in this question). Because there can be more than one diagnosis associated with an episode, it means that the same episode number might appear more than once for a patient_id. And because the same patient might have more than one episode the same patient_id might appear more than once in the file. (even though both these numbers are unique identifiers of patients and their episodes in the real world)

Bearing all that in mind I need the average cost per patient

The sample calculation would be as follows.....

total cost for patient 12a is calculated by adding the total costs for each of his episode numbers making sure to only pull out costs for unique episode numbers (1x + 2y) =1000+1000 = €2000, in my real file they may not be unique because of the diagnosis codes.
total cost for patient 13b is €3000 (he only has one episode number 4z)
total cost for patient 14c is calculated by adding the total costs for each of his episode numbers, making sure to only pull out total costs for unique episode numbers (5x, 6z, 9x) = 2500 +2500+2500 = €7500
total cost for patient 15d is €1000 (he only has one episode number 8y)

the average cost per patient is calculated by summing the total cost for each unique patient (12a=€2000, 13b=€3000,14c=€7500,15f=€1000) and dividing by the number of unique patients (4)

(2000+3000+7500+1000)/4 = 3375

As opposed to just averaging all the costs

1000+1000+3000+2500+2500+2500+1000/7 = 1928.57

I need to make sure that the overall average reflects the fact that patient 14c actually has a total cost of €7500 as opposed to being calculated on the assumption that there are 3 patients with lower costs because the expression pulls out the same patient number 3 times for the average. The uniqueness is important. This becomes important as I will also need to adjust the expression from calculating average to calculating median for other purposes.

Apologies for long message and if I am missing something obvious.

Louise

betthisisnttake · ‎2017-07-25

Hi Andrew,

I think your answer for average total cost per patient is right but I am puzzled as to how it is being calculated as there is no part of the expression that is summing the total costs for each unique episode_id. The only relevance of diagnosis is that I may choose it as a dimension in a chart to get the values for each - so don't worry about that if constructing the expression.

If helpful, I've attached a copy of my answer to Kushal here where he requested what I would like the result to be and how I would like it calculated. It is a bit long winded but I am trying to ensure all relevant information is present.

all time & thoughts much appreciated.

Louise

Hi kushal -thanks for your time on this,

In the real file there is potential for more than one diagnosis being associated with each episode for each patient (that is the only relevance of diagnosis in this question). Because there can be more than one diagnosis associated with an episode, it means that the same episode number might appear more than once for a patient_id. And because the same patient might have more than one episode the same patient_id might appear more than once in the file. (even though both these numbers are unique identifiers of patients and their episodes in the real world)

Bearing all that in mind about the underlying data table structure, I still need to extract the average cost per patient

The sample calculation would be as follows.....

total cost for patient 12a is calculated by adding the total costs for each of his episode numbers making sure to only pull out costs for unique episode numbers (1x + 2y) =1000+1000 = €2000, in my real file they may not be unique because of the diagnosis codes.
total cost for patient 13b is €3000 (he only has one episode number 4z)
total cost for patient 14c is calculated by adding the total costs for each of his episode numbers, making sure to only pull out total costs for unique episode numbers (5x, 6z, 9x) = 2500 +2500+2500 = €7500
total cost for patient 15d is €1000 (he only has one episode number 8y)

the average cost per patient is calculated by summing the total cost for each unique patient (12a=€2000, 13b=€3000,14c=€7500,15f=€1000) and dividing by the number of unique patients (4)

(2000+3000+7500+1000)/4 = 3375

As opposed to just averaging all the costs

1000+1000+3000+2500+2500+2500+1000/7 = 1928.57

I need to make sure that the overall average reflects the fact that patient 14c actually has a total cost of €7500 as opposed to being calculated on the assumption that there are 3 patients with lower costs because the expression pulls out the same patient number 3 times for the average. The uniqueness is important. This becomes important as I will also need to adjust the expression from calculating average to calculating median for other purposes.

Apologies for long message and if I am missing something obvious.

Louise

Kushal_Chawda · ‎2017-07-25

you can try something like below

Data:

LOAD * Inline [

patient_id Episode Number diagnosis total_cost

12a 1x blah1 1000

12a 2y blah2 1000

13b 4z blah1 3000

14c 5x blah1 2500

14c 6z blah2 2500

14c 9x blah3 2500

15d 8y blah1 1000 ] (delimiter is '\t');

Left Join(Data)

LOAD patient_id,

sum(total_cost) as TotalCostPatientWise

Resident Data

Group by patient_id;

Create straight table

Dimension:

diagnosis

Expression:

=sum(DISTINCT TotalCostPatientWise)/Count(DISTINCT total patient_id)

Kushal_Chawda · ‎2017-07-25

effinty2112 · ‎2017-07-25

Hi Louise,

The expression I used

avg(Aggr(sum(total_cost),patient_id,diagnosis))

is equivalent to

avg(Aggr(sum(total_cost),[Episode Number]))

because, if I understand correctly, each episode is a patient_id, diagnosis pair.

patient_id	diagnosis	Episode Number	sum(total_cost)	sum(total_cost)/count([Episode Number])
12a	blah1	1x	1000	1000
	blah2	2y	1000	1000
	Total		2000	1000
13b	blah1	4z	3000	3000
13b	Total		3000	3000
14c	blah1	5x	2500	2500
	blah2	6z	2500	2500
	blah3	9x	2500	2500
	Total		7500	2500
15d	blah1	8y	1000	1000
15d	Total		1000	1000
Total			13500	1928.5714285714

We can take out dimension Episode Number and get the same result

patient_id	diagnosis	sum(total_cost)	sum(total_cost)/count([Episode Number])
12a	blah1	1000	1000
	blah2	1000	1000
	Total	2000	1000
13b	blah1	3000	3000
13b	Total	3000	3000
14c	blah1	2500	2500
	blah2	2500	2500
	blah3	2500	2500
	Total	7500	2500
15d	blah1	1000	1000
15d	Total	1000	1000
Total		13500	1928.5714285714

Or have Episode Number only

Episode Number	sum(total_cost)	sum(total_cost)/count([Episode Number])
1x	1000	1000
2y	1000	1000
4z	3000	3000
5x	2500	2500
6z	2500	2500
8y	1000	1000
9x	2500	2500
Total	13500	1928.5714285714

Hope this helps.

Andrew

neelamsaroha157 · ‎2017-07-25

Script:

Tab1:

LOAD * INLINE [

patient_id, Episode Number, diagnosis, total_cost

12a, 1x, blah1, 1000

12a, 2y, blah2, 1000

13b, 4z, blah1, 3000

14c, 5x, blah1, 2500

14c, 6z, blah2, 2500

14c, 9x, blah3, 2500

15d, 8y, blah1, 1000

];

NoConcatenate

LOAD patient_id,

Sum(total_cost) as NewCost

Resident Tab1

Group by patient_id

;

Expression:

Avg(NewCost)

neelamsaroha157 · ‎2017-07-25

or if you want original values at Episode level then you can change expression as

If(RowNo() =0, Avg(NewCost), Sum(total_cost))

betthisisnttake · ‎2017-07-26

Just to update - none of these ideas got me exactly what I needed, however they have provided me with couple of jump off points which have allowed me to combine a few ideas and get further than I was before. It looks like it can all be done via expressions and not involve load scripts and I am confident I can get it to work. For the moment I have to go do a Qlik Sense June 2017 server upgrade so I can't work on it now for a while. I will revert with the working expressions and close the question when I get back to it again. Many thanks for the insights and time, much appreciated.

Extracting unique values for aggregation - patient data