Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Discover how organizations are unlocking new revenue streams: Watch here
cancel
Showing results for 
Search instead for 
Did you mean: 
YoussefBelloum
Champion
Champion

Base RAM for ETL apps

Hi everyone, 

I hope you're doing well.

 

To prepare a migration to QlikCloud and to calculate the sizing for the machines that will host the Qlik data gateways, I would like to calculate or retrieve the data volume (Base RAM) loaded by ETL applications. The problem is that on all monitoring applications as well as on the QMC, Qlik completely loses track due to the "drop tables" placed in the middle or at the end of the load script.
 
Do you have any idea if there is a way to perform this calculation?
 
Thank you.
 
-Youssef
Labels (2)
9 Replies
lachlanwcast
Partner - Creator
Partner - Creator

One option to see the 'total' RAM usage during a reload is to setup a perfmon collector set, log everything to a file and then manually work out the peak RAM usage between the start and end of the reload?

https://techcommunity.microsoft.com/blog/askperf/windows-performance-monitor-overview/375481

YoussefBelloum
Champion
Champion
Author

Thank you @lachlanwcast 

I like your suggestion. I'll give it a try.

I'll leave this thread open to see if there is other solutions.

-Youssef

PrashantSangle

You will get this information in Qlik Cloud Readiness app.

Great dreamer's dreams never fulfilled, they are always transcended.
Please appreciate our Qlik community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂
YoussefBelloum
Champion
Champion
Author

@PrashantSangle 
Nope, all ETL apps (apps that nclude drop tables) shows "zero" on the used base ram, on the Readiness app and on all the monitoring apps out there.

Thanks anyway.

-Youssef

marcus_sommer

You may get respectively derive the Base RAM by reading the information from the stored files per filesize() and/or to store/drop temporary every load-step as file and fetching their information, too.

Another approach might be to read the meta-data of the tables per loop over all fields and their fieldvaluecount() * avg() bytes per value + considering the size of the table-index per noofrows(). The results would be never exact but suitable to get an order of magnitude.

Both methods aren't without some efforts whereby the first one is rather simple and the second quite complex. But it won't be mandatory necessary to apply it against everything else just the few most heavy applications - I assume you knows already which ones.

Beside the above be aware that the RAM consumption during any joins/aggregations/crosstable/... might be much bigger as the final output-data. To fetch this kind of information you will need a look from the outside to the events of the ETL like mentioned by @lachlanwcast (you may also look for usable information within the performance-logs of the QS server). 

YoussefBelloum
Champion
Champion
Author

Thank you @marcus_sommer 

Very interesting approaches.

I'll take both and the give it a try.

The Ram consumption during joings/aggregations/cross etc I think I will not need it, because It will not be calculated by the resources of the machines that will host the Qlik data gateway, it will instead use the resources of the QlikCloud tenant.

My need is basically to calculate the amount of on-prem data uploaded to the tenant. to size correctly the Data gateway machines.

-Youssef

marcus_sommer

This means that mainly the sources of the raw-data define the sizing and that rather the sql/storage-system should answer the question. This might be further impacted by any incremental approaches or a data-compressing whereby such measurements would also create some efforts. 

YoussefBelloum
Champion
Champion
Author

@marcus_sommer 

It will help if it was the exact same size of the raw data. but on every sql script, there is explicit SELECT statement, not really SELECT *. 

They are selecting the columns, making filters and aggregations, which reduce the amount of data comparing the raw/total data.

marcus_sommer

I think the most pragmatically approach would be to look manually for the 5 biggest loads and a simple calculation of:

((F1 * DataType) + (F2 * DataType) + (F3 * DataType) + ....) * Records

whereby Records is then an approximately assumption of % reduction from the filter-conditions and aggregations.

It should be enough in regard to the max. peak because the load-tasks are very likely scheduled in defined chains because the number of parallel tasks are restricted to the performance of the data-base and the network and there might be also various dependencies in regard to the task-orders.