Re: Allowing Aggregated View of Data without Acces... - Qlik Community

seckenrode · ‎2023-11-16

Hello,

We are currently planning implementation of a few embedded analytics features which require building visualizations that include aggregated views of data that we have in Qlik but require that we keep the row level data behind those views inaccessible from users — both in standard operations and in sidechannel attacks. We're running into some roadblocks with how Qlik embedding and section access works in this context and wanted to ask how others approach this problem.

For the purposes of discussion here, you can imagine we have a data model that is something like:

respondent with demographic attributes
region associations for that respondent
survey responses for that respondent
application users
application user<>region associations

Our basic requirement are:

Users should be able to see aggregated views of response (e.g. averages, sums, etc)
Users should be able to filter and pivot that response based on respondent facts (e.g. pivot by gender, filter by two or three other demographic fields)
If the filters reduce the dataset size below a certain threshold(e.g. if the filters applied result in less than 5 respondents in the dataset) the aggregated values should be hidden
Some users can see detailed information about the repsondents in their region but should only see restricted aggregated data across all regions for comparison & benchmarking purposes.
Filtering & pivoting of data should be near realtime (<1s responses ideally)
New data is loaded multiple times a day and load times should remain low (<5min/app ideally)

There are a few issues that we've encountered:

In order to generate the aggregated metrics on the fly, we need to grant access to row level access to the underlying response data. We could implement the confidentiality thresholding in our measure definitions. However, given that users have access to the row level data and Qlik integrations use a client-side websocket based connection to get data, a nefarious users could get access to the underlying row level data by using that websocket connection to build a custom chart with the raw data. This is true for both custom enigma based integrations and higher-order integrations using iframes or visualizations since they all open a websocket in user-space.
Supporting comparisons to a larger dataset with rich filtering means we could not use section access for restricting the more detailed data. This means we'd have to implement custom security & section access rules in our application in ways that cannot be bypassed by a nefarious user editing client-side logic.
To support filtering by arbitrary demographic combinations, a pre-aggregated data model in Qlik would need to compute potentially a few dozen billion combinations (we can have upwards of 100+ unique filters that can be applied in combination). This seems prohibitively expensive to compute at load time and does not allow us to leverage a lot of the power of Qlik for dynamic reporting.

Approaches we've been investigating involve things like proxying the websocket connection through a server we own to implement additional permission & filtering logic to ensure that users are not making ad hoc queries against the data, or making compromises on our product requirements to better match Qlik's security story. Neither of those seem like particularly great.

Has anyone encountered this type of security/privacy requirement in their Qlik implementation? How have you approached those challenges?

Thank you!

marcus_sommer · ‎2023-11-16

I think I would try to go with a classical section access approach - by using a Fact Table with Mixed Granularity - Qlik Community - 1468238. This means the various totals will be aggregated within the script and per concatenate added to the detail-data. Also an extra field is added which contains the information if the record is a total- or a detail-information which is further used as part of the reduction-key from the section access. The total-users will only get the totals and the detail-users only the details - means both will be restricted to a certain data-set. All aggregations within the UI should be working as if it would be a normal dataset without these extra measures.

Of course it will increase the number of records but assuming that the average aggregation-rate in regard to the records are about 90% it would results in about 10% more records by a rather not significantly increasing the distinct number of field-values and should be therefore have no essential impact on the UI performance.

By the reload-side it would be different because aggregations especially against a larger amount of grouping-fields are quite heavy transformations. But within a well-designed multi-layer incremental loading approach it might be solvable within the available refresh-time.

Another respectively an additionally approach could be to Mask or de-identify data for certain users using S... - Qlik Community - 1496092 and/or to double the relevant fields and using OMIT to control which one is displayed for the user. Restrictions in the way to show details or masked data or totals in regard to the number of possible values are quite difficult to solve and will probably also need some pre-counting in any extra flag-fields.

seckenrode · ‎2023-11-16

Thanks Marcus,

The fact table with Mixed Granularity seems aligned with some of the pre-aggregated data model solutions. I might be missing something about how that approach doesn't expand rapidly with the distinct number of field-value combinations. When I've considered this approach before, we've been worried about that expansion.

For example, a row of the raw data might look like:

Region	Gender	Cohort	Language	Question 1	Question 2
Region 1	M	2023	English	Answer 1	Answer 5
Region 1	F	2024	Spanish	Answer 3	Answer 6

Based on reading the article, it seems like we'd probably make a data model that uses a Generic key for region to allow us to have both the row-level data and then the data for all regions (or subsets of regions). The question columns would be converted to question-answer counts (e.g. Question 1 - Answer 1 = 1 for the first row and the aggregate row would have the sum of responses in that bucket).

However, given Gender, Cohort, and Language are disjoint filters that could be applied at either level, my understanding is that we would have to still create a row for each of the aggregate rows for each combination of those field-value combinations (e.g. M|2023|English, M|2023|Spanish, F|2023|English, etc) in the totals table. For some of our standard use cases, we have 19 different demographic fields with between 3 and 100 values — our demo data, for example, contains 60,035,076,351,590,400 possible combinations of filters that our users can choose from.

Does that match with your understanding of how we would apply the mixed granularity approach or are there other clever things we can do with generic keys to reduce that possible filter space?

We considered the Mask or De-identify data for certain users using Section Access but it doesn't fit our needs since users would still be able to get access to the individual row level data. This data in Qlik doesn't include identifying information other than demographic columns, but we consider the combination of demographic columns associated with the response to be sufficient information for a knowledgeable user to identify respondents, which is a scenario we'd like to avoid.

Thanks!

marcus_sommer · ‎2023-11-17

Completely independent from any section access I would tend to transform this crosstable-structure into a (normal) stream-data-structure. This means the filter-attributes would be merged into the two columns of filter-attributes and filter-values and the same would happens with questions and answers:

The Crosstable Load - Qlik Community - 1468083

and the number of different attributes/questions between the various surveys become quite irrelevant.

This will change your quite wide table into a small and long table which may have slight disadvantages in the storage / RAM size of the application but the UI performance and especially the usability (in the data-model as well as the UI) will be usually significantly better.

The crosstable-transforming might not be the first measure, for example if any kind of counting/summing/scoring of the answers is in place it would be sensible to apply it as long as the information are on a record-level. Further helpful could be to create any kind of categorizing/grouping on the filter-attributes and the question-types to simplify selections as well as providing them as object-dimensions. By applying more extended solutions like The As-Of Table - Qlik Community - 1466130 you may also enable overlapping categories.

With such a transformation the above mentioned aggregations and using a mixed granularity should be possible. I suggest that you start this crosstable-transformation within a parallel test-project by skipping the section access requirements and most of the regular ETL and UI approaches (no layer-logic else generator + data-model + report in one application without the normal UI else just a few tables and list-boxes) to play with it and then step by step increasing the logic + requirements.

seckenrode · ‎2023-11-29

Sorry for the delay! The holiday over here in the US meant I hadn't had significant time to dig back into this. I appreciate the answers here. We've done some more thinking on our integration side and I think we can make some compromises on the detail of the aggregate rows (e.g. we don't need to support all filter combinations) that make our problem a bit nicer. We're not just wrestling with how to best approach the higher level comparison problems.

I've dug into a bit of the mixed granularity approach and I think it makes some sense. There are some quirks I'm still wrestling with, but I think they're less related to the mixed granularity table and more related to things like how we're modeling hierarchies of respondents, buildings, regions, etc in our data model.

It also seems like given the mixed access you'd have in the mixed granularity table, the measures might get complicated unless I'm misunderstanding things. For example, if we have a mixed granularity table with somethign like:

Generic Key	Question 1 Answer 1 Count	Question 1 Answer 2 Count
respondent1\|building1\|region1	1	0
respondent2\|building1\|region1	0	1
respondent3\|building2\|region1	1	0
*\|building1\|region1	1	1
*\|building2\|region1	1	0
\|\|region1	2	1

If I'm a user with access to building 1, section access should limit them to see only the following data:

Generic Key	Question 1 Answer 1 Count	Question 1 Answer 2 Count
respondent1\|building1\|region1	1	0
respondent2\|building1\|region1	0	1
*\|building1\|region1	1	1
\|\|region1	2	1

Given that table has duplicated data in the 3rd and 4th rows (e.g. respondent 1 and 2 are in both), if we were trying to get answer counts for the region as a whole, we'd need to essentially craft a measure that only looks for the *|*|region1 row, not make a generic measure like SUM({Region = 'region1'} Question 1 Answer 1 Count) on the table.

I think this is due to the fact that the data in the fact table isn't disjoint — in the Mixed Granularity example, it is disjoint. If it's disjoint you don't end up with overlapping data in the rows and therefore no double counting.

We couldn't use section access to remove the generic *|*|region1 role since they have access to a building in that region but not all buildings. They need to see the overall region data without individual building breakdowns for other buildings. We'd have to calculate something like *|!building1|region1, but there are also many combinations of that data...

I've been playing with some simpler approaches that might be more constructive (e.g. a building average table and a region average table), but it means we wouldn't necessarily be able to share the same measure across different levels of reporting — we'd have to change some of our report design to know which table to look at based on some other application context. Not the end of the world, but something we were hoping to avoid. C'est la vie.

Very open to other avenues to explore here!

marcus_sommer · ‎2023-11-30

By reading your answer I'm not sure if you had tried to resolve the crosstable-structures of the dimension-fields and of the measure-fields to a "normal" data-structure?

The target-idea behind my suggestions is to get a data-set in which no complex generic keys are needed else the key is just a numeric integer - categorizing the dimension-areas like the buildings and regions as well as the hierarchically level. This will end in quite long fact-table as well as in quite long section access table - but the key and the relationship would be very simple.

The crosstable-transform of the fact-table is not difficult and also the to add afterwards multiple aggregated sub-sets to it (by a larger data-set it could become a quite heavy processing but with appropriate incremental approaches the final run-times should be sufficient). The deduction of the section access table from the given access-information and matching it to the facts is the complete opposite and might become very complex.

Further I think it's important not to mix up the different requirements within the section access and within the data-model. This means the authorization of a record and/or a column should be independent from the way how the fields are used as dimensions and selections and how they are grouped/categorized. Also the section access itself - the normal user-related field-value-reduction is a different part as providing aggregated data and also different to mask certain data and should be therefore developed separately and then merged in the end.

seckenrode · ‎2023-11-30

Thanks Marcus! I think I looked pass the cross-table suggestion thinking it was just for the demographic fields, which already are actually represented as long tables instead of the wide versions I posted before.

I'm not sure I'm completely following this bit of your suggestion:

The target-idea behind my suggestions is to get a data-set in which no complex generic keys are needed else the key is just a numeric integer - categorizing the dimension-areas like the buildings and regions as well as the hierarchically level.

Are you suggesting a similar model with multiple versions of the fact tables for the different dimensions? Something like:

respondent_counts

respondent_key	question	answer	count
respondent1	1	1	1
respondent2	1	2	1
respondent3	1	1	1

building_counts

building_key	question	answer	count
building1	1	1	1
building1	1	2	1
building2	1	1	1

region_counts

region_key	question	answer	count
region1	1	1	2
region1	1	2	1

Thanks for all your responses here, I greatly appreciate your patience as I wrap my head around some of these different modeling approaches — this is my first major project modeling more complicated data in qlik for analysis and I'm learning a lot as I dig in!

marcus_sommer · ‎2023-12-01

It was more meant in this way because multiple keys aren't needed:

key	question	answer	count
respondent1	1	1	1
respondent2	1	2	1
respondent3	1	1	1

key	question	answer	count
building1	1	1	1
building1	1	2	1
building2	1	1	1

key	question	answer	count
region1	1	1	2
region1	1	2	1

seckenrode · ‎2023-12-01

Great thanks for the clarification — since there might be overlap in the keys for different levels (e.g. respondent_key, building_key, and region_key are currently set up as different dimension tables for the different objects), we'd still need to generate a new key for the `key` column that combines type & key, which then, if i'm understanding this all correctly, means we'd still need a dimensional link table to get back to the original objects, right?

e.g. something like LOAD autonumber('respondent' & respondent_key), respondent_key, null() as building_key, null() as region_key

marcus_sommer · ‎2023-12-04

I'm not sure which kind of dimensional model would be suitable for your requirements but like above already hinted I wouldn't tend to many (on single fields and/or with combined fields and with autonumber() or not) key-fields else regarding such an approach rather as a worst case scenario if no other ideas could be successfully implemented.

At first I would try to keep it as simple as possible and just using recno() as key between the facts and the dimensions. Applying recno() as RecNo within the source-loads and before doing any crosstable-transformations would provide the information which data belong together.

Even if this don't worked at ones I wouldn't discard such approach too early else trying to adjust and to extend it to support a solution. One direction for it might be not only to use (multi-sources) fact-table with a different granularity else applying a similar approach also to the dimension-table. This means the dimension-table is created by several concatenate statements adding one dimension-area to the next by resulting in an ID + DimensionType + Source combined key-field which may act as a kind of link-table.

Generic Key	Question 1 Answer 1 Count	Question 1 Answer 2 Count
respondent1\|building1\|region1	1	0
respondent2\|building1\|region1	0	1
respondent3\|building2\|region1	1	0
*\|building1\|region1	1	1
*\|building2\|region1	1	0
\|\|region1	2	1

Allowing Aggregated View of Data without Access to Underlying Row Level Data

General Question

SaaS

Security