Qlik Community

QlikView App Development

Discussion Board for collaboration related to QlikView App Development.

Announcements
QlikView Fans! We’d love to hear from you.
Share your QlikView feedback with the product team… Click here to participate in our 5-minute survey.
Rules, plus terms and conditions, can be found here.
jlongoria
New Contributor III

Direct Discovery: What is DIRECT SELECT doing in this example?

I created a QlikView document using direct discovery to reference a large fact table for revenue (19 Billion rows). I am also referencing two in-memory tables from QlikView: a time dimension (12 rows of data) and a customer dimension (about 770K rows of data). The load script is below. I am associating the revenue information with the customer and time dimension information using the CUST_ACCT_ID and REVENUE_YR_MNTH_NBR, respectively.


I am seeing behavior I don't understand. When I load the script I first see this:


Connecting to Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXX;Extended Properties=""

Connected

CUSTOMER_DIM << CUSTOMER_DIM (qvd optimized) 770,990 lines fetched

TimeDimension << INL3347 12 lines fetched

DirectTable << INL3358 0 lines fetched

REVENUE_FACT.CUST_ACCT_ID (fetching)


It sits there for a loooooong time like 25-30 minutes before it does anything. I understand QV has to buildi a memory map and I understand performance is based on source DB system. Given this behavior, it seems that the memory map is not built using the dimension tables (as I thought) but rather using the actual data in the FACT table. Is that the case? If the REVENUE data is going to be retrieved later *BASED* on the dimensions/selections, WHY does QV "fetch" data from the REVENUE table? Why does QV have to fetch anything from the fact table at script/data load time?

It is likely that I may be missing a fundamental concept with direct discovery or I just plain did something wrong. Can someone explain what I am seeing and hopefully set me straight?


OLEDB CONNECT32 TO [Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXXX;Extended Properties=""] (XPassword is xxxxxxxxxxxxxxxxxxx);

SET DirectConnectionMax=10;

CUSTOMER_DIM:
LOAD
     CUST_ACCT_ID,

     ADDR_CITY_NM,
     ADDR_CNTRY_CD,
     ADDR_ST_CD,
     ADDR_ZIP_CD,
     CUST_FULL_NM
FROM
CUSTOMER_DIM.qvd
(qvd);


TimeDimension:
Load * Inline [
REVENUE_YR_MNTH_NBR
201301
201302
201303
201304
201305
201306
201307
201308
201309
201310
201311
201312
];


REVENUE:
DIRECT SELECT
    CUST_ACCT_ID,
    REVENUE_YR_MNTH_NBR
FROM REVENUE_FACT;

2 Replies

Re: Direct Discovery: What is DIRECT SELECT doing in this example?

Joe

I am also trying to understand what Direct Discovery does under the bonnet.

Empirically I reckon that this from your script :

REVENUE:

DIRECT SELECT

    CUST_ACCT_ID,

    REVENUE_YR_MNTH_NBR

FROM REVENUE_FACT;

Actually executes this SQL against your database :

SELECT DISTINCT

    CUST_ACCT_ID,

    REVENUE_YR_MNTH_NBR

FROM REVENUE_FACT;

Would you be able to execute that directly against your database, i.e. not via QlikView, and advise if that also takes a loooooong time like 25-30 minutes and also how many rows it returns.



Best Regards,     Bill

jlongoria
New Contributor III

Re: Direct Discovery: What is DIRECT SELECT doing in this example?

I spoke to some folks from QlikView. They explained that indeed the Direct Discovery statement I have in my QVW will generate a select distinct SQL call to the DB for the dimension items I've defined. That said, I talked to our DBA to determine how the DB was setup for this large fact table (19B row). As per the available documentation, QlikView performance is dependent on the performance of the DB.

The DBA explained the DB table is partioned based on the REVENUE_YR_MNTH_NBR (it's an Oracle DB) with sub-partitions on another dimension (which I wasn't using) and that I should make sure whatever query I come up with leverages availble partions. The DBA went into some detail about the built-in parallelism Oracle has for performance/optimization/etc.

After talking with the DBA, I added an additional dimension and filter condition to my Direct Query statement in an attempt to leverage the sub-partitions when QV actually runs a query. I also constrained the data to a (much) smaller date rage. It is still slow (in my opinion), but after running queries manually against the source DB, I think this is as fast as I can go.

I plan to gradually widen the filter critera an determine how far I can offload data association to Direct Discovery and if it's worth using Direct Discovery or if I should find an alternate approach using aggregated data and document chaining (perhaps in combination with Direct Discovery) or something like that.

I don't have anything working yet, but I wanted to post an update based on what I know thus far. I plan to update this thread as I learn more.

Community Browser