Hi Tanel, thanks for sharing the tests. I'm also surprised by the RowNo() performance.
Advantages of Hash() over Autonumber[Hash]() can also be that it can be used on incremental loads, disavantage is that it creates a string, not a number.
Other advantage of AutoNumber() is that it created a sequential integer.
(Waiting for experts thoughts on this tests)
Thank you for cataloging the key field generation techniques, and evaluating their comparitive performances, this is a useful collection that will be used for future reference.
There's an additional element could be added to evaluating technique options, and that is lineage. Very often there is a need to return to the source data itself and view the before/after effects of loading & transformation. This analysis could make note of which techniques can be set to correlate to matching actual rows from the input source(s).
A few examples:
The RecNo() feature is keyed to match back to a specific row in a sourcing spreadsheet. (However, this excercise becomes more complicated & requires a little more stridency when the spreadsheet header or body of data is allowed to contain varying amounts of fully empty rows)
The input is coming from a series of input files, and row tracing now must not only track the input row in the file, but which of many possible input files contributed the row (for example, the second of a series of many spreadsheet inputs might concatenate the 51st row to the QlikView table, but this 51st row originates from the 1st row of the 2nd input spreadsheet. Which row id is correct?)
FileID + RecNo()
The input is pulling from multiple tabs, across multiple spreadsheets (hierarchy qualifying layers continue to be added on, till you get a near parallel of Database.Schema,Table.Row)
FileID + TabID + RecNo()
And then there are cases where the flat-files become of such size (> GB's) that most text editors become unable to even open the file and allow manual perusal to the rows in question. In this case, the RecNo() keying needs to be precise enough to allow other tools capable of extracting specific rows from a large file, to find the match and pull it out for verification.
(Autonumber will definitely produce a "leaner" key, but won't help when making raw data to post-scripting verification, whereas incorporating original verbatim elements of the original row into the keying systems help ensure you're dealing with the correct match. Also, Autonumber resets on a load by load basis, and very often creates confusion when it is stored to QVD and attempts are made to "reuse" an older Autonumber. Techniques that build keys off of static elements of the original row should be consistently identifiable across the full existence of the application, and very often that is worth a trade-off in performance.
In personal experience, design has often started off with "inefficient" keying systems that ensure the raw data to Row ID tagging correlation is sound, then after the ETL sequence is tested & trustworthy, those inefficient keys are replaced with better performing ones in final design form.
The paramount attribute for these latter mentioned keying systems, not hinging so much on performance, but on the necessity of being able to complete the audit trail between source to end-usage.
Hope you don't mind the added suggestion. Thank you for putting this together, will definitely make use of it. ~E
good points for sure.
I agree that load speed shouldn't be the primary concern for app development.
Actually, until performance isn't critical, I prefer to leave full text keys like ItemID_WarehouseID_YYMMDD.
The benefits are obvious:
- easy to understand
- easy to debug
- can be reused across datamodels
- can be decoded to original values