Re: Pulling data from Postgresql DB - Question ab... - Qlik Community

EliGohar · ‎2024-07-17

Hi,

We work with Postgresql DB to retrieve data into Qlik Sense.
We identified (in the attached example - retrieving 4 million records) that the query at the SQL level ends in around 40 seconds, while another process takes 4 minutes.

20240717T110810.305+0300 0024 LIB CONNECT TO 'PostgreSQL_localhost'
20240717T110811.328+0300 Connected.
20240717T110811.330+0300 0028 SQL SELECT *
20240717T110811.330+0300 0029 FROM table1
20240717T110811.330+0300 0030 WHERE "timestamp" > '2024-06-27 20:04:08'
20240717T110855.187+0300 43 fields found: field1, field2, field3, field4, field5, field6, field7, field8, field9, field10, field11, field12, field13, field14, field15, field16, field17, field18, field19, field20, field21, field22, field23, field24, field25, field26, field27, field28, field29, field30, field31, field32, field33, field34, field35, field36, field37, field38, field39, field40, field41, field42, timestamp
20240717T111256.239+0300 4,002,000 lines fetched
20240717T111257.239+0300 Execution finished.

Is it possible to get an explanation of what exactly happens after the query ends on Qlik side?

What exactly is done in the fetching phase that takes 4 minutes?

Is it possible to debug the matter using logs?

Thanks,

Eli.

marcus_sommer · ‎2024-07-17

I'm not sure that this interpretation of the document-log is correct. I could imagine that it means that the data-base needs the 40 seconds to start the return of the first records which are then finished after further 4 minutes.

You may a bit playing with this thought by changing the number of records with an adjusted where-clause to load against yesterday and/or the yearstart or ...

Vishal_Gupta · ‎2024-07-17

@EliGohar not sure what you mean by 'another process' but it seems that once the tables/records are extracted, Qlik might be indexing those records which might add to the overall extraction time.

Also agree with @marcus_sommer , interpretation seems incorrect.

If a post helps to resolve your issue, please accept it as a SOLUTION and leave a LIKE!

EliGohar · ‎2024-07-18

@Vishal_Gupta - By indexing do you mean search index using CreateSearchIndexOnReload parameter? because we disabled it in the script.

@marcus_sommer - I may be wrong in my interpretation, but I saw that on the Postgresql side the query that Qlik sends through the script ends in 40 seconds. So what exactly happens next? Is there a way to improve the retrieval speed?

Thanks!

marcus_sommer · ‎2024-07-18

Like above hinted I assume that the 40 seconds is the query on the data-base side and the other time is the network-transfer per driver to Qlik and it's loading into the system- and data-tables.

I suggest to apply the mentioned adjustments to the where-clause and/or also some playing with 1 / 3 / 10 / n fields - there should be some insightful pattern ...

Assuming that really all fields and records are mandatory needed you may consider to apply any incremental logic and/or to divide the load into n loads of n fields. There are more possibilities but the effects are rather small.

Levi_Turner · ‎2024-07-19

To start out with, Qlik doesn't have the data when it reports "43 fields found: ". Here's a debug look into what is happening under the hood on each side with default settings:

The brief version of this is that Qlik will report the fields found in the first batch of information from PostgreSQL then continue to process the stream of data from PostgreSQL.

As background, PostgreSQL's ODBC driver (and Qlik's specific implementation) supports 4 modes:

Retrieve Entire Result Set in Memory
Use Multiple Statements
Single Row Mode
Use Declare / Fetch

Reference docs: https://docs.posit.co/drivers/1.7.0/pdf/Simba%20PostgreSQL%20ODBC%20Driver%20Install%20and%20Configu...

Ages ago I did some basic testing with these results:

The default behavior of the driver is to use a Use Declare / Fetch mode with a cache or page size of 200 rows of data.

Retrieve Entire Result Set In Memory, as the name implies will queue up a batch of data from the database and deliver it en masse to the Qlik Engine. We can see that the memory usage is relatively high compared to the Single Row and Use Declare / Fetch methods. It seems to be the fastest option to be sure, but only slightly in this relatively simple test.

Use Multiple Statements behaves a bit unexpectedly to my eyes. The memory usage is highest and the reload time is near the worst. It isn’t obvious how this mode is particularly relevant to a BI use case as our Engine will only issue a single SELECT statement so our queries will never come in multiples to take advantage of the theoretical benefits here.

Single Row Mode implies that the entire result set is stored in memory by PostgreSQL and the driver accepts a single row at a time. In terms of performance, it is slower than the fastest mode but not significantly so.

Notes:

This is a synthetic test where Qlik Sense Desktop is connecting to a local PostgreSQL instance. My guess is that the Use Declare / Fetch (and the default setup) would perform worse in terms of performance when more realistic networking is involved as the batches are rendered on the PostgreSQL side and then seem to be delivered to our ODBC Connector.
This was a very simple data set so I suspect the memory usage by the connector will show greater differences with more realistic data. I expect the Single Row Mode to be the lowest in terms of resource usage as the ODBC connector only needs to process a single row of data vs. the Use Declare / Fetch method (batches of multiple records) or the Entire Result Set In Memory method which likely would falter at scale in many environments including Qlik SaaS

How to implement: Set the key value pairs in the Advanced section of the connector config. Example setting SingleRowMode

EliGohar

Hi Levi,

Thanks for the detailed answer!

Our data source in Postgres DB is a complex view, each record in view involved some data calculations: concatenation of strings, and some math actions.

So when using batch fetching (by cursor) it takes a long time.

When we tried to use SingleRowMode - the total time reduced to ~1.5min, but we see that the stress moved to db (1.2min is the fetching from DB)

Do you have any idea how to improve this scenario?

Thanks again,

Eli.

Levi_Turner

Before delving into the technical, I would recommend assessing what you are trying to achieve. Via the advanced parameters, you have increased the throughput to Qlik, but at a cost of undesired strain on the database.

How often do you plan to reload this application? A 4 minute reload for an application which will be reloaded during business hours is often acceptable.
When does the source data get updated? If the underlying source data is updated via a CDC process every hour, then doing a reload sub-hourly serves no purpose.
What is the variant of source data? Insert only? Insert + update? Insert, update, and delete?

Let’s presume that you want to have as fast a reload as possible simultaneously lowering the stress on the underlying data source as much as possible. The standard approach in Qlik is to use a technique called an incremental load:

The simple description of this is that you will:

Perform an initial bulk load from the source and store this data to disk as a QVD file
Subsequent loads will load historical data from this QVD, interrogate the field which is used in a WHERE clause in the SQL load to filter to the latest records.

There are a number of articles across Community on this concept.

EliGohar

@Levi_Turner We already implemented incremental load. Data is loaded incrementally every 5 minutes (Insert only). The incremental loads insert about 100K new records (maximum), so they are done quite quickly.
Our interest is with the initial loading, which as mentioned can include millions of records, and therefore the purpose of the post - to understand if something can be improved in the initial loading.

Eli.

Pulling data from Postgresql DB - Question about performance

General Question