2 Replies Latest reply: Oct 24, 2013 7:29 AM by Joe L RSS

    Direct Discovery: What is DIRECT SELECT doing in this example?

    Joe L

      I created a QlikView document using direct discovery to reference a large fact table for revenue (19 Billion rows). I am also referencing two in-memory tables from QlikView: a time dimension (12 rows of data) and a customer dimension (about 770K rows of data). The load script is below. I am associating the revenue information with the customer and time dimension information using the CUST_ACCT_ID and REVENUE_YR_MNTH_NBR, respectively.


      I am seeing behavior I don't understand. When I load the script I first see this:


      Connecting to Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXX;Extended Properties=""

      Connected

      CUSTOMER_DIM << CUSTOMER_DIM (qvd optimized) 770,990 lines fetched

      TimeDimension << INL3347 12 lines fetched

      DirectTable << INL3358 0 lines fetched

      REVENUE_FACT.CUST_ACCT_ID (fetching)


      It sits there for a loooooong time like 25-30 minutes before it does anything. I understand QV has to buildi a memory map and I understand performance is based on source DB system. Given this behavior, it seems that the memory map is not built using the dimension tables (as I thought) but rather using the actual data in the FACT table. Is that the case? If the REVENUE data is going to be retrieved later *BASED* on the dimensions/selections, WHY does QV "fetch" data from the REVENUE table? Why does QV have to fetch anything from the fact table at script/data load time?

       

      It is likely that I may be missing a fundamental concept with direct discovery or I just plain did something wrong. Can someone explain what I am seeing and hopefully set me straight?


      OLEDB CONNECT32 TO [Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXXX;Extended Properties=""] (XPassword is xxxxxxxxxxxxxxxxxxx);

      SET DirectConnectionMax=10;

      CUSTOMER_DIM:
      LOAD
           CUST_ACCT_ID,

           ADDR_CITY_NM,
           ADDR_CNTRY_CD,
           ADDR_ST_CD,
           ADDR_ZIP_CD,
           CUST_FULL_NM
      FROM
      CUSTOMER_DIM.qvd
      (qvd);


      TimeDimension:
      Load * Inline [
      REVENUE_YR_MNTH_NBR
      201301
      201302
      201303
      201304
      201305
      201306
      201307
      201308
      201309
      201310
      201311
      201312
      ];


      REVENUE:
      DIRECT SELECT
          CUST_ACCT_ID,
          REVENUE_YR_MNTH_NBR
      FROM REVENUE_FACT;

        • Re: Direct Discovery: What is DIRECT SELECT doing in this example?
          Bill Markham

          Joe

           

          I am also trying to understand what Direct Discovery does under the bonnet.

           

          Empirically I reckon that this from your script :

           

          REVENUE:

          DIRECT SELECT

              CUST_ACCT_ID,

              REVENUE_YR_MNTH_NBR

          FROM REVENUE_FACT;

           

          Actually executes this SQL against your database :

           

          SELECT DISTINCT

              CUST_ACCT_ID,

              REVENUE_YR_MNTH_NBR

          FROM REVENUE_FACT;

           

          Would you be able to execute that directly against your database, i.e. not via QlikView, and advise if that also takes a loooooong time like 25-30 minutes and also how many rows it returns.



          Best Regards,     Bill

            • Re: Direct Discovery: What is DIRECT SELECT doing in this example?
              Joe L

              I spoke to some folks from QlikView. They explained that indeed the Direct Discovery statement I have in my QVW will generate a select distinct SQL call to the DB for the dimension items I've defined. That said, I talked to our DBA to determine how the DB was setup for this large fact table (19B row). As per the available documentation, QlikView performance is dependent on the performance of the DB.

               

              The DBA explained the DB table is partioned based on the REVENUE_YR_MNTH_NBR (it's an Oracle DB) with sub-partitions on another dimension (which I wasn't using) and that I should make sure whatever query I come up with leverages availble partions. The DBA went into some detail about the built-in parallelism Oracle has for performance/optimization/etc.

               

              After talking with the DBA, I added an additional dimension and filter condition to my Direct Query statement in an attempt to leverage the sub-partitions when QV actually runs a query. I also constrained the data to a (much) smaller date rage. It is still slow (in my opinion), but after running queries manually against the source DB, I think this is as fast as I can go.

               

              I plan to gradually widen the filter critera an determine how far I can offload data association to Direct Discovery and if it's worth using Direct Discovery or if I should find an alternate approach using aggregated data and document chaining (perhaps in combination with Direct Discovery) or something like that.

               

              I don't have anything working yet, but I wanted to post an update based on what I know thus far. I plan to update this thread as I learn more.