Peek() function slows my script - is there a workaround?
I have 2.5m rows and it takes around 5 minutes to load the following script. Is it a normal situation with peek() ?
if(Peek('Client')='Client',numsum(peek('PurchaseNo',-1),1),1) as PurchaseNo,//Присваиваю номера покупкам в разрезе Партнеров
if(peek('Client')='Client',date-peek('date'),0) as DaysBetweenPurchases, //рассчитываю кол-во дней между повторными покупками в разрезе Партнеров,
if(peek('Client')='Client',(date-peek('date'))/30,0) as MonthsBetweenPurchases
Order By Client asc, date asc;
Peek will be slow. To speed things up you might look at chunking up the data so that you don't have to re-parse all of it every time and then do optimised loads from QVD. The challenge is always working out what you can chunk on.
In my experience peek() itself isn't so slow that it couldn't be applied on larger datasets. Therefore I go with Sunny to check if the applied combination of statements (distinct, order by, if, peek) may delay the execution significantly. This means not only just removing one or two of them and noticing that the run-time is lesser and therefore the cause of the delaying must be the remaing statements else only the combination of all the transformations may cause this effect - especially if the load didn't run multi-threaded anymore else it's forced to a single-threaded execution it will increase the run-times quite heavily.
To check this you may try to do the distinct load a step before maybe even with the necessary ordering - which might be even applied another step before and/or the ordering is reduced to a single field maybe by any rowno() approach or by combining the needed fields preferably not as strings else as a number, for example with Client as the integer-value and date becomes the frac-part.
To insert one or two inbetween steps within the load-order may not necessary lead to a longer run-times at all epecially if it was intentionally designed in this way. This includes especially the suggestion from Steve to slice your data by applying any incremental approaches.
Of course to re-think and (partly) re-design the whole load could be become quite expensive and not appropriate for this case but you would also learn a lot so that the real benefits come on later projects.
Nevertheless here a few direct suggestions which may have also an impact on the run-time:
- replace all peek() unless peek('PurchaseNo') with previous() - replace numsum() with rangesum() - making the Client check only once: - maybe within a preceeding approach - first: -(Client = previous(Client)) as Flag - second: rangemin(rangesum(peek('PurchaseNo'), 1) * Flag, 1) as PurchaseNo - third: DaysBetweenPurchases / 30 as ... here might be also a mapping possible