Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi, so we have the following bit of code in our ETL script that has some fairly dense conditional logic. I was just wondering if anyone knew any optimisation techniques I could apply to try and optimise this? This is by fair the slowest part of our load and it would be great if I could speed it up a bit.
LOAD *,
ApplyMap('LegalEntityRegion',[TLegal Entity],'remap') as [LegalEntity Region],
IF(Match(
[TTrade Source System],'A','B', 'C')>0,'V',
IF(isnull(LOOKUP('Flow Type','PlatformMap',TPlatform,'FlowType')),[TPlatform Description],
LOOKUP('Flow Type','PlatformMap',TPlatform,'FlowType')
)
)
as [Vol Type],
IF([TLocal Ccy Region] = 'G10','G10',
IF([TTrade Source System]='B',
IF(WildMatch([TPrimary CCY],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM') ,
IF(WildMatch([TCurrency Pair],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM')
)
)
as [Ccy Group],
IF([TTrade Source System] = 'C','Client',
pick(wildMatch([TTeam Name],'Traders','Back to Back','Broker','Senior Management','Treasury - Egypt','Treasury - Kuwait','N/A','Non Users','Non Users - Research Website','Origination','*Non_Sales*', '*'), 'nonClient','nonClient','nonClient','nonClient','nonClient','nonClient','nonClient','nonClient','nonClient','nonClient','nonClient', 'Client') )
as [Client Flag]
Resident tmpAllocationsSource
Hi Richard,
Try implementing the script attached. I've broken down all conditionals into their smallest variations. It should hopefully speed up your processing significantly. I'm going to make an educated guess that your wider script could be further optimized as well based on the 'tmp' in the table name of tmpAllocationsSource. If you're processing this table several times over and if it's some form of fact table, then further optimization can likely take place to reduce down the number of times you process these records. In essence, aim to have only ONE load step loading the bulk of your fact data, join on any remaining attributes and avoid conditional logic as far as possible as you're likely incurring logic processing that may have only a handful of variation of output results across 100's of millions of records.
I hope it helps, kindly post some feedback either way.
-Jonas
From my experience, I let the SQL handle this kind of heavy nested IFs by using CASE or in WHERE clause of the SQL script.
Just my 2 cents.
Yeah, ideally this would all be done in the database, but unfortunately that isn't an option to us at the moment.
A couple of thoughts without picking it apart in detail. Lookup is notoriously slow. Try to replace that section with mapping.
How about handling all those WildMatch() functions in the Where clause of the load script? This way, Qlik will pull the necessary data from the source.
It's difficult to say if it could be speed up significantly. An general recommendation isn't to load heavy loadings per resident load else per load from a qvd but mostly the differences aren't big. More potential should be have a replacing from the lookups() with applymap() by [Vol Type] and possibly by [Ccy Group] too respectively instead of:
...
IF(WildMatch([TPrimary CCY],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM') ,
IF(WildMatch([TCurrency Pair],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM')
...
this one:
...
IF(WildMatch([TPrimary CCY],'*CNY*','*CNH*','*CNT*')>0 or
WildMatch([TCurrency Pair],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM')
...
Alos in your wildcard * might be potential if there are fields with a high cardinality which could be removed or splitted (like a timestamp into dates and times).
- Marcus
It's difficult to say without further analysis on where the bottleneck is.
I'd do some performance test with:
See what has the most impact on performance and act on that.
Mapping instead of lookup seems very likely, as mentioned by Rob and Marcus above.
For the wildmatches, do you have some sort of "master dimension" table for fields like TPrimary or TTeam Name? If so, and the number of distinct values is reasonably low, you might create a mapping table before the load, with all the possible values and the "calculated" result that they need, like this:
Mapping map_1:
LOAD [TPrimary CCY] as Code,IF(WildMatch([TPrimary CCY],'*CNY*','*CNH*','*CNT*')>0,'RMB','EM') as description
resident Dim_TPrimary_CCY
the use a mapping instead of a wildmatch on the full-records load.
Hi Richard,
Try implementing the script attached. I've broken down all conditionals into their smallest variations. It should hopefully speed up your processing significantly. I'm going to make an educated guess that your wider script could be further optimized as well based on the 'tmp' in the table name of tmpAllocationsSource. If you're processing this table several times over and if it's some form of fact table, then further optimization can likely take place to reduce down the number of times you process these records. In essence, aim to have only ONE load step loading the bulk of your fact data, join on any remaining attributes and avoid conditional logic as far as possible as you're likely incurring logic processing that may have only a handful of variation of output results across 100's of millions of records.
I hope it helps, kindly post some feedback either way.
-Jonas
Wow, Jonas thanks! This is immense! There's a lot to pour over here, so I'm going to have to take some time to digest it. Immediately though I can see you've reduced the working set by saving off and working with distinct values, which is already a really great idea, Thank you so much for your time!