Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello,
Am I correct in understanding that if I use AutoNumber on a key field, for instance
AutoNumber(ProductID&CustomerID)
that the symbol table for the Key Field "ProductID&CustomerID" would look like this?
Pointer Value
1
2
that is, that the Value is empty because a look-up is not needed? Or would the symbol table look like this?
Pointer Value
1 1
2 2
Also is there a way to view the symbol tables? Thank you.
The distribution of the data into data-tables which contain only fields with bit-stuffed pointer and are associated with these pointers to the appropriate system-tables which contain only the distinct field-values is not really a compression else a quite efficient way to store data. The more redundant the data are the bigger are the benefits against linear storage-logic like the classical sql or csv.
On top may come a zip-like compression - at least in QlikView there is an appropriate setting - which could significantly reduce the storage-size but usually are the disadvantages of increasing the CPU consumption and storing/opening times greater as the saved storage.
Further possible are measurements like sorting and indexing the data and/or storing only min/max + offset values by incrementing-data or using the pointer as value and/or applying any kind of partitioning but it will also add complexity and increasing the CPU consumption.
In other words the native storage-logic of Qlik which becomes especially obvious by loading qvd's optimized or performing a binary load - which is a direct data-transfer from the storage into the RAM with nearly no data-processing - is quite near the maximum in regard to the performance.
I missed all the fun on this question...
I'll reflect on your last question - "If you have 1M rows and 100 distinct values for "Category", how does the symbol table help compress this field? "
The way Qlik does de-duplication and internal indexing of the data, with the Symbols table and the Index table, helps reduce the size of duplicate values. Instead of storing 1M rows with a 20-character string (for example), you only store 100 distinct values, and then 1M rows with a pointer - in this case it's a 7-bit pointer. 7 bits take approximately 20+ times less memory than 20 characters. So, roughly 95% compression.
Keep in mind, that this de-duplication happens automatically behind the scene, and you don't need to do anything for it.
Your decision could pertain to the question "Does Category belong in the Fact table, or in a Product Dimension table?". Typically, Category is an attribute that belongs in the Products hierarchy. So, usually you'd have a choice of keeping the field in the Products table or moving it to the Fact table. And the answer to this question is "It depends".
From the memory consumption perspective, a Category field stored in a Products Dimensional table will require less memory, because the Index table for Products is much shorted than the Index table for a Facts table. However, performance of specific objects and measures could differ in this case. Sometimes charts work faster when their Dimensions are stored in the Fact, and sometimes they don't. These issues need to be tested and verified in each particular data set.
Cheers,
Oleg
Thanks for your comment. Yes this is what I thought would happen. The "Category" was a bad example, I just wanted an example of a dimension that does not belong in a dimension table.
In this case like I think was mentioned in this thread, the dimension should be in the fact table, there is no need to have a dimension table if there are no attributes for it. Also calculation wise, it should be better since you have it in the same table.
I can also add that when you are using AutoNumber in the load statements for different key fields like
AutoNumber(Customer) as Customer,
AutoNumber(Product) as Product
In this case you will not get sequential integers in the individual fields, and thereby you would still get symbol tables for these fields is my understanding. Which can be good to keep in mind. You can solve this by using the second parameter of the AutoNumber function like
AutoNumber(Product, ‘Product’) as Product
AutoNumber(Customer, ‘Customer’) as Customer
However from my testing it seems that this is taken cared of automatically using the AutoNumber fieldlist like @rwunderlich suggested
AutoNumber Product, Customer;