Direct Discovery: What is DIRECT SELECT doing in this example?
I created a QlikView document using direct discovery to reference a large fact table for revenue (19 Billion rows). I am also referencing two in-memory tables from QlikView: a time dimension (12 rows of data) and a customer dimension (about 770K rows of data). The load script is below. I am associating the revenue information with the customer and time dimension information using the CUST_ACCT_ID and REVENUE_YR_MNTH_NBR, respectively.
I am seeing behavior I don't understand. When I load the script I first see this:
Connecting to Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXX;Extended Properties=""
It sits there for a loooooong time like 25-30 minutes before it does anything. I understand QV has to buildi a memory map and I understand performance is based on source DB system. Given this behavior, it seems that the memory map is not built using the dimension tables (as I thought) but rather using the actual data in the FACT table. Is that the case? If the REVENUE data is going to be retrieved later *BASED* on the dimensions/selections, WHY does QV "fetch" data from the REVENUE table? Why does QV have to fetch anything from the fact table at script/data load time?
It is likely that I may be missing a fundamental concept with direct discovery or I just plain did something wrong. Can someone explain what I am seeing and hopefully set me straight?
OLEDB CONNECT32 TO [Provider=OraOLEDB.Oracle.1;Persist Security Info=True;User ID=XXXX;Data Source=XXXXXXX;Extended Properties=""] (XPassword is xxxxxxxxxxxxxxxxxxx);
CUSTOMER_DIM: LOAD CUST_ACCT_ID,
ADDR_CITY_NM, ADDR_CNTRY_CD, ADDR_ST_CD, ADDR_ZIP_CD, CUST_FULL_NM FROM CUSTOMER_DIM.qvd (qvd);
Re: Direct Discovery: What is DIRECT SELECT doing in this example?
I spoke to some folks from QlikView. They explained that indeed the Direct Discovery statement I have in my QVW will generate a select distinct SQL call to the DB for the dimension items I've defined. That said, I talked to our DBA to determine how the DB was setup for this large fact table (19B row). As per the available documentation, QlikView performance is dependent on the performance of the DB.
The DBA explained the DB table is partioned based on the REVENUE_YR_MNTH_NBR (it's an Oracle DB) with sub-partitions on another dimension (which I wasn't using) and that I should make sure whatever query I come up with leverages availble partions. The DBA went into some detail about the built-in parallelism Oracle has for performance/optimization/etc.
After talking with the DBA, I added an additional dimension and filter condition to my Direct Query statement in an attempt to leverage the sub-partitions when QV actually runs a query. I also constrained the data to a (much) smaller date rage. It is still slow (in my opinion), but after running queries manually against the source DB, I think this is as fast as I can go.
I plan to gradually widen the filter critera an determine how far I can offload data association to Direct Discovery and if it's worth using Direct Discovery or if I should find an alternate approach using aggregated data and document chaining (perhaps in combination with Direct Discovery) or something like that.
I don't have anything working yet, but I wanted to post an update based on what I know thus far. I plan to update this thread as I learn more.