Perf problem with data discovery - Qlik Community

nicolas_martin · ‎2017-07-21

Hello,

I have a VERY huge table stored in a Vertica instance (really fast for SELECT queries).

Basicly, I have the following fields:

- date

- store

- product

- price

Where I have for each store for each date for each product, a price.

This table is approximatively 150 GB.

I've loaded this table in QV (12.1) with a script like

my_table:

DIRECT QUERY

DIMENSION

year,

date,

store,

product

MEASURE

price

FROM my_table;

In QlikView, I've made a chart with dimension "Store" and "Product", and expression:

avg(price)

If I filter on "year = 2017", this charts takes minutess to compute.

In another chart, with the expression:

avg(

aggr(

if(product = 'My Favotite', avg(price))
, Store, Product)
)

to view only the prices of "My Favorite" product (something I would have done with a Set Analysis).

This chart also takes minutes (maybe 10) to compute.

In a query tool, when I submit the following query:

SELECT
store,
product,
avg(price)
FROM my_table
WHERE year = 2017
GROUP BY store, product;

it takes 2 seconds to give me the result.

And

SELECT
store,
product,
avg(price)
FROM my_table
WHERE year = 2017
AND product = 'My Favorite'
GROUP BY store, product;

less than 1 second.

I would like to understand why there is such a difference.

I didn't find where to set the parameter that creates a SQL.log and view the exact query.

Report Inappropriate Content · ‎2017-07-21

Hi Nicolas,

I tried using direct discovery a few years back (on 11.2). I found so many limitations that it was basically unusable in all but the most basic of scenarios. It's a real shame, as it would be an exceptionally useful capability to have (had it been implemented properly).

I can't remember the specifics (sorry, it was quite a few years ago now), but I found an enormous overhead similar to what you're experiencing in many cases. I ran some traces at the time and I believe QV was running some "select distinct" queries on all the dimension fields on the script reload (which due to the underlying queries we had this was very inefficient for the specific use case I was prototyping). From memory, I think it would run a select distinct using the entire query core, but then just using a select distinct for every single field marked as a dimension.

So if you have

select

dima

,dimb

,dimc

,metric

from

[some really complex query]

qv would run 3 queries at reload time

select distinct dima

from

[some really complex query]

select distinct dimb

from

[some really complex query]

select distinct dimc

from

[some really complex query]

There was no way to tell QV to avoid the complex query and just pick up the distinct list of dimension values from another more performant table/query. This is just one example of the types of things I found. Not specifically related to your problem by the sound of it, but just for background.

For your specific problems, I think your best bet would be to run some DB traces to understand where the issue is - you might be able to work around it depending on the specifics of your environment and use case, but expect it to be an uphill journey.

Regards,

Graeme