Counting Simultaneous Tasks in Charts

JGMDataAnalysis · ‎2023-11-16

Hi! Could you tell me if there is a chart aggregation function that allows me to count simultaneous tasks?
I'm currently generating the "Datetimes" table and then using the Aggr function on the measure.
The problem is that "Datetimes" has more than 40 million records and performance is considerably degraded by using the Aggr function.

Is there a more efficient way to obtain the same result?

@rwunderlich, @marcus_sommer, @sunny_talwar

marcus_sommer · ‎2023-11-17

The question is more for what the timestamp-table is really needed? Means at first - is the time-part really essential? If, could the timestamp be separated into dates and times? Do you need to keep this information within an extra table or could the to the from-to period related information be joined/mapped to the facts and such intermediate logic dropped again?

It means at all to transfer the essential logic into the data-model by optimizing it instead of optimizing the UI performance.

JGMDataAnalysis · ‎2023-11-21

Thank you very much, @marcus_sommer.
I split the TaskDateTime field into TaskDate and TaskTime, respectively. By making this modification I noticed an improvement in performance, but I am still not satisfied with the result.
Then I tried joining the timestamp-table to the fact table but in this case I saw no difference. For this reason I decided to leave them separately.
As a basic premise, I need to calculate the concurrency of tasks based on the filters applied by users.
To perform this calculation I must consider the tasks that were running in each second, beyond the fact that in the charts I am going to use year, month, date or hour as dimensions.

vincent_ardiet_ · ‎2023-11-21

You can create another table and take advantage of the way Qlik is dealing with cardinality:

refTask:
Load FieldValue('TaskID',RecNo()) as ID#,FieldValue('TaskID',RecNo()) as TaskIDKey AutoGenerate FieldValueCount('TaskID') ;

And then, if you count TaskIDKey you don't need to put the Distinct and so this is using multiple threads and should offer better performances.

JGMDataAnalysis · ‎2023-11-21

Thanks, @vincent_ardiet_.
I never considered the optimization you propose because some time ago I read an @hic article explaining that the performance difference between Count and Count Distinct existed in older versions.

https://community.qlik.com/t5/Design/A-Myth-About-Count-distinct/ba-p/1476053

However, I could do some tests.

vincent_ardiet_ · ‎2023-11-21

Indeed, I'm also curious to know the results.

marcus_sommer · ‎2023-11-21

Some more questions. The fact-table or the resolved DateTime has 40 M of records? Further separating dates and times leads then to expressions like?

max(aggr(count(distinct TaskID), TaskDate, TaskTime))

or maybe:

max(aggr(count(distinct TaskID), Year, Month, Date, Hour))

Beside this the question - is an aggr() really needed by a table-chart with the dimension of: Year + Month + Date + Hour? Because the number of distinct tasks within the applied dimensions would the normal: count(distinct TaskID) return. By a KPI-object it might be different but here it would a single aggregation while the table-chart will perform it for each dimension-value.

If minutes and second aren't relevant on the dimension-level they might be skipped in the aggr() and/or you could try to work with multiple DateTime- and/or Time-fields - means removing of granularity and providing a complete time-field with 86400 values and one without seconds and 1440 values and the next would already the hour with just 24 values - the last maybe also with a date respectively date + hour. Of course the extra fields will increase the size of the data-model but choosing the appropriate one within the relevant UI expressions may improve the performance.

Further impact on the performance has the rest of the data-model. The best compromise in regard of simplicity, maintainability, efforts and performance is usually a star-scheme model. Especially link-table models should be avoided especially if the UI performance is the bottleneck.

JGMDataAnalysis · ‎2023-11-21

Thanks again, Marcus.
In principle, I must provide more information about the real context of the data.
What I called "tasks" in the example correspond in reality to "actions" associated with a
Provisioning System of a Telecommunications company.
These "actions" last a few seconds, except those executed through a CRON that can last several hours.
Under these circumstances, I understand that I am obliged to consider up to the second to
correctly calculate the number of concurrent "actions", regardless of whether the charts group up to the hour.

Answering the questions:
- The resolved DateTime has 40 M of records
- max(aggr(count(distinct TaskID), TaskDate, TaskTime))

Data Model Viewer:

If for the following example I used the measure max(aggr(count(distinct TaskID), Year, Month, Date, Hour)) I would get 2 as a result, which would not be correct because only one task ran simultaneously.

Am I misunderstanding what you conveyed to me or, even worse, how Aggr works?

vincent_ardiet_ · ‎2023-11-21

If you are sure to have one event each second, you could also try to sort the table by actionid and time, and using previous you can check if the same task was running 1 second before, create a flag with 0 in this case, else 1. In your UI you can then sum this flag.

JGMDataAnalysis · ‎2023-11-21

In simple terms, I have tasks that start and end. Based on this, I should be able to calculate the maximum number of tasks that ran simultaneously in a given year, month, week, date or date + hour according to the selections made by the user. For this last reason, I consider that it is not possible to precompute a flag because I cannot know in advance which registers will come into play.

It helps me to think of it as one-second partitions into which tasks will fit according to their duration.

General Question