1 Reply Latest reply: Jul 21, 2017 6:35 AM by Graeme Smith RSS

    Perf problem with data discovery

    Nicolas MARTIN

      Hello,

       

      I have a VERY huge table stored in a Vertica instance (really fast for SELECT queries).

      Basicly, I have the following fields:

      - date

      - store

      - product

      - price

       

      Where I have for each store for each date for each product, a price.

      This table is approximatively 150 GB.

       

      I've loaded this table in QV (12.1) with a script like

      my_table:
      DIRECT QUERY
      DIMENSION
      year,
      date,
      store,
      product
      MEASURE
      price
      FROM my_table;
      

       

       

      In QlikView, I've made a chart with dimension "Store" and "Product", and expression:

      avg(price)
      

       

      If I filter on "year = 2017", this charts takes minutess to compute.

       

      In another chart, with the expression:

      avg(
      
      aggr(
      
      if(product = 'My Favotite', avg(price))
      , Store, Product)
      )
      
      

       

      to view only the prices of "My Favorite" product (something I would have done with a Set Analysis).

      This chart also takes minutes (maybe 10) to compute.

       

       

      In a query tool, when I submit the following query:

      SELECT
      store,
      product,
      avg(price)
      FROM my_table
      WHERE year = 2017
      GROUP BY store, product;
      

       

      it takes 2 seconds to give me the result.

       

      And

      SELECT
      store,
      product,
      avg(price)
      FROM my_table
      WHERE year = 2017
      AND product = 'My Favorite'
      GROUP BY store, product;
      

      less than 1 second.

       

       

      I would like to understand why there is such a difference.

       

      I didn't find where to set the parameter that creates a SQL.log and view the exact query.

        • Re: Perf problem with data discovery
          Graeme Smith

          Hi Nicolas,

           

          I tried using direct discovery a few years back (on 11.2).  I found so many limitations that it was basically unusable in all but the most basic of scenarios.  It's a real shame, as it would be an exceptionally useful capability to have (had it been implemented properly). 

           

          I can't remember the specifics (sorry, it was quite a few years ago now), but I found an enormous overhead similar to what you're experiencing in many cases.  I ran some traces at the time and I believe QV was running some "select distinct" queries on all the dimension fields on the script reload (which due to the underlying queries we had this was very inefficient for the specific use case I was prototyping).   From memory, I think it would run a select distinct using the entire query core, but then just using a select distinct for every single field marked as a dimension. 

           

          So if you have

          select

               dima

               ,dimb

               ,dimc

               ,metric

          from

               [some really complex query]

           

           

          qv would run 3 queries at reload time

           

           

          select distinct dima

          from

               [some really complex query]

           

          select distinct dimb

          from

               [some really complex query]

           

          select distinct dimc

          from

               [some really complex query]

           

          There was no way to tell QV to avoid the complex query and just pick up the distinct list of dimension values from another more performant table/query.  This is just one example of the types of things I found.  Not specifically related to your problem by the sound of it, but just for background.

           

          For your specific problems, I think your best bet would be to run some DB traces to understand where the issue is - you might be able to work around it depending on the specifics of your environment and use case, but expect it to be an uphill journey.

           

          Regards,

           

          Graeme