Solved: Re: Qlik Catalog Data Load Record Limit - Qlik Community

kathysun · ‎2021-10-20

Hi all,

I'm quite new with Qlik catalog and currently is exploring methods to limit record count for registered entities data load.

In my understanding registered entities only load a sample of the data. Can anyone clarify how does Qlik Catalog define how much data the tool will ingest during data load? and is it possible to set/limit the amount for data load by the admin?

Thank you.

Christopher_Ortega · ‎2021-10-21

Hi Kathy,

Thanks for your explanation of your use case. I'd like to clarify one thing you've mentioned a couple of times. The sample is independent of the profile. The profile is generated off whatever data is loaded. Concurrently a sample is retained for Catalog users. These are independent - the profile doesn't depend on the sample.

Let's take your example of a table with 500 million rows. With no changes, all 500 million rows will be brought into the Catalog for the purposes of profiling and in so doing is also sampled. The profile is on the full data set and a sample of 1% (by default) is retained - so, in this case your sample would be 5 million rows. The data used to profile (500 million rows) is removed. This leaves you with a fully representative profile and a sample data set for viewing/use within the Catalog. Again, to clarify the profile is on everything returned by the query in the property src.file.glob.

If you are concerned that even a 1% sample is too much to retain, as I agree 5 million records is still quite a lot, you can configure this.

There is a property that you can set on the entity called record.sampling.probability. By default this is 0.01. You can however set it lower if you'd like. Let's say that I only wanted to retain a sample around 50,000 records. I might then set the property to 0.0001 to achieve this.

https://help.qlik.com/en-US/catalog/August2021/Content/QlikCatalog/Discover/Sample_Data_and_Data_Loa...

I hope this helps.

Chris

View solution in original post

Christopher_Ortega · ‎2021-10-20

Hi,

Qlik Catalog will load everything that is in a file or table for the purposes of generating a profile and a sample of the data. The profile is on the full table as is the sample.

There is a property called src.file.glob which contains the query (or file location) for the data to be extracted. By default, it includes no "where clause" or "LIMIT" statement.

One could modify this property to include a LIMIT, putting a threshold on what gets profiled. Please note in this case that the profile might not be a truly representative profile, as all values aren't taken into consideration for minimums, maximums, and frequency distributions.

kathysun · ‎2021-10-21

Hey Christopher,

Thank you so much for your reply. I think the accuracy for the statistic summary (sum,max,min) are quite important. The purpose of this ask is that currently I am trying to expose all tables in the org and some of the tables might have way too many rows of info. All entities are registered entities, with only sample of data will be loaded for users to explore. With tables having millions and millions of row, 10% can be still unnecessarily causing too much consumption and processing power. I'm curious for data loading, is it possible to only load a definite number of rows set by admin?

also following up with the accuracy for the stat summary, even if the entities are set to be registered. the values still are calculated based on all rows right (instead of only based on the loaded sample data?)

Thank you!

Christopher_Ortega · ‎2021-10-21

Hi Kathy,

Thanks for your explanation of your use case. I'd like to clarify one thing you've mentioned a couple of times. The sample is independent of the profile. The profile is generated off whatever data is loaded. Concurrently a sample is retained for Catalog users. These are independent - the profile doesn't depend on the sample.

Let's take your example of a table with 500 million rows. With no changes, all 500 million rows will be brought into the Catalog for the purposes of profiling and in so doing is also sampled. The profile is on the full data set and a sample of 1% (by default) is retained - so, in this case your sample would be 5 million rows. The data used to profile (500 million rows) is removed. This leaves you with a fully representative profile and a sample data set for viewing/use within the Catalog. Again, to clarify the profile is on everything returned by the query in the property src.file.glob.

If you are concerned that even a 1% sample is too much to retain, as I agree 5 million records is still quite a lot, you can configure this.

There is a property that you can set on the entity called record.sampling.probability. By default this is 0.01. You can however set it lower if you'd like. Let's say that I only wanted to retain a sample around 50,000 records. I might then set the property to 0.0001 to achieve this.

https://help.qlik.com/en-US/catalog/August2021/Content/QlikCatalog/Discover/Sample_Data_and_Data_Loa...

I hope this helps.

Chris

kathysun · ‎2021-10-22

Thank you so much Christopher. This exactly answers all my questions!

Christopher_Ortega · ‎2021-10-22

You are welcome. I'm happy to help.

One thing I didn't mention, but of which you should also be aware.... properties such as this can be inherited. What I mean by that is you can set the property on the source rather than the entity and all entities within that source, if the property is not set, will inherit it. Of course, if the entity has a different value for the property that one will be utilized.

Good luck!

Qlik Catalog Data Load Record Limit

General Question