Base RAM for ETL apps

YoussefBelloum · ‎2025-03-24

Hi everyone,

I hope you're doing well.

To prepare a migration to QlikCloud and to calculate the sizing for the machines that will host the Qlik data gateways, I would like to calculate or retrieve the data volume (Base RAM) loaded by ETL applications. The problem is that on all monitoring applications as well as on the QMC, Qlik completely loses track due to the "drop tables" placed in the middle or at the end of the load script.

Do you have any idea if there is a way to perform this calculation?

Thank you.

-Youssef

lachlanwcast · ‎2025-04-09

One option to see the 'total' RAM usage during a reload is to setup a perfmon collector set, log everything to a file and then manually work out the peak RAM usage between the start and end of the reload?

https://techcommunity.microsoft.com/blog/askperf/windows-performance-monitor-overview/375481

YoussefBelloum · ‎2025-04-10

Thank you @lachlanwcast

I like your suggestion. I'll give it a try.

I'll leave this thread open to see if there is other solutions.

-Youssef

PrashantSangle · ‎2025-04-10

You will get this information in Qlik Cloud Readiness app.

Great dreamer's dreams never fulfilled, they are always transcended.
Please appreciate our Qlik community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

YoussefBelloum · ‎2025-04-10

@PrashantSangle
Nope, all ETL apps (apps that nclude drop tables) shows "zero" on the used base ram, on the Readiness app and on all the monitoring apps out there.

Thanks anyway.

-Youssef

marcus_sommer · ‎2025-04-10

You may get respectively derive the Base RAM by reading the information from the stored files per filesize() and/or to store/drop temporary every load-step as file and fetching their information, too.

Another approach might be to read the meta-data of the tables per loop over all fields and their fieldvaluecount() * avg() bytes per value + considering the size of the table-index per noofrows(). The results would be never exact but suitable to get an order of magnitude.

Both methods aren't without some efforts whereby the first one is rather simple and the second quite complex. But it won't be mandatory necessary to apply it against everything else just the few most heavy applications - I assume you knows already which ones.

Beside the above be aware that the RAM consumption during any joins/aggregations/crosstable/... might be much bigger as the final output-data. To fetch this kind of information you will need a look from the outside to the events of the ETL like mentioned by @lachlanwcast (you may also look for usable information within the performance-logs of the QS server).

YoussefBelloum · ‎2025-04-10

Thank you @marcus_sommer

Very interesting approaches.

I'll take both and the give it a try.

The Ram consumption during joings/aggregations/cross etc I think I will not need it, because It will not be calculated by the resources of the machines that will host the Qlik data gateway, it will instead use the resources of the QlikCloud tenant.

My need is basically to calculate the amount of on-prem data uploaded to the tenant. to size correctly the Data gateway machines.

-Youssef

marcus_sommer · ‎2025-04-10

This means that mainly the sources of the raw-data define the sizing and that rather the sql/storage-system should answer the question. This might be further impacted by any incremental approaches or a data-compressing whereby such measurements would also create some efforts.

YoussefBelloum · ‎2025-04-10

@marcus_sommer

It will help if it was the exact same size of the raw data. but on every sql script, there is explicit SELECT statement, not really SELECT *.

They are selecting the columns, making filters and aggregations, which reduce the amount of data comparing the raw/total data.

marcus_sommer · ‎2025-04-11

I think the most pragmatically approach would be to look manually for the 5 biggest loads and a simple calculation of:

((F1 * DataType) + (F2 * DataType) + (F3 * DataType) + ....) * Records

whereby Records is then an approximately assumption of % reduction from the filter-conditions and aggregations.

It should be enough in regard to the max. peak because the load-tasks are very likely scheduled in defined chains because the number of parallel tasks are restricted to the performance of the data-base and the network and there might be also various dependencies in regard to the task-orders.

SaaS

Script and Expressions