Direct Discovery: What is DIRECT SELECT doing in t... - Qlik Community

jlongoria · ‎2013-10-15

I created a QlikView document using direct discovery to reference a large fact table for revenue (19 Billion rows). I am also referencing two in-memory tables from QlikView: a time dimension (12 rows of data) and a customer dimension (about 770K rows of data). The load script is below. I am associating the revenue information with the customer and time dimension information using the CUST_ACCT_ID and REVENUE_YR_MNTH_NBR, respectively.

I am seeing behavior I don't understand. When I load the script I first see this:

Connecting to Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXX;Extended Properties=""

Connected

CUSTOMER_DIM << CUSTOMER_DIM (qvd optimized) 770,990 lines fetched

TimeDimension << INL3347 12 lines fetched

DirectTable << INL3358 0 lines fetched

REVENUE_FACT.CUST_ACCT_ID (fetching)

It sits there for a loooooong time like 25-30 minutes before it does anything. I understand QV has to buildi a memory map and I understand performance is based on source DB system. Given this behavior, it seems that the memory map is not built using the dimension tables (as I thought) but rather using the actual data in the FACT table. Is that the case? If the REVENUE data is going to be retrieved later *BASED* on the dimensions/selections, WHY does QV "fetch" data from the REVENUE table? Why does QV have to fetch anything from the fact table at script/data load time?

It is likely that I may be missing a fundamental concept with direct discovery or I just plain did something wrong. Can someone explain what I am seeing and hopefully set me straight?

OLEDB CONNECT32 TO [Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXXX;Extended Properties=""] (XPassword is xxxxxxxxxxxxxxxxxxx);

SET DirectConnectionMax=10;

CUSTOMER_DIM:
LOAD
CUST_ACCT_ID,

     ADDR_CITY_NM,
     ADDR_CNTRY_CD,
     ADDR_ST_CD,
     ADDR_ZIP_CD,
     CUST_FULL_NM
FROM
CUSTOMER_DIM.qvd
(qvd);

TimeDimension:
Load * Inline [
REVENUE_YR_MNTH_NBR
201301
201302
201303
201304
201305
201306
201307
201308
201309
201310
201311
201312
];

REVENUE:
DIRECT SELECT
CUST_ACCT_ID,
REVENUE_YR_MNTH_NBR
FROM REVENUE_FACT;

Anonymous · ‎2013-10-19

Joe

I am also trying to understand what Direct Discovery does under the bonnet.

Empirically I reckon that this from your script :

REVENUE:

DIRECT SELECT

CUST_ACCT_ID,

REVENUE_YR_MNTH_NBR

FROM REVENUE_FACT;

Actually executes this SQL against your database :

SELECT DISTINCT

CUST_ACCT_ID,

REVENUE_YR_MNTH_NBR

FROM REVENUE_FACT;

Would you be able to execute that directly against your database, i.e. not via QlikView, and advise if that also takes a loooooong time like 25-30 minutes and also how many rows it returns.

Best Regards, Bill

jlongoria · ‎2013-10-24

I spoke to some folks from QlikView. They explained that indeed the Direct Discovery statement I have in my QVW will generate a select distinct SQL call to the DB for the dimension items I've defined. That said, I talked to our DBA to determine how the DB was setup for this large fact table (19B row). As per the available documentation, QlikView performance is dependent on the performance of the DB.

The DBA explained the DB table is partioned based on the REVENUE_YR_MNTH_NBR (it's an Oracle DB) with sub-partitions on another dimension (which I wasn't using) and that I should make sure whatever query I come up with leverages availble partions. The DBA went into some detail about the built-in parallelism Oracle has for performance/optimization/etc.

After talking with the DBA, I added an additional dimension and filter condition to my Direct Query statement in an attempt to leverage the sub-partitions when QV actually runs a query. I also constrained the data to a (much) smaller date rage. It is still slow (in my opinion), but after running queries manually against the source DB, I think this is as fast as I can go.

I plan to gradually widen the filter critera an determine how far I can offload data association to Direct Discovery and if it's worth using Direct Discovery or if I should find an alternate approach using aggregated data and document chaining (perhaps in combination with Direct Discovery) or something like that.

I don't have anything working yet, but I wanted to post an update based on what I know thus far. I plan to update this thread as I learn more.

Direct Discovery: What is DIRECT SELECT doing in this example?