Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello Everyone,
I am stuck with a scenario from last few days. The scenario is as follows.
I have a table with following Fields.
Product, StartDate, EndDate, Stock
And a master calendar. which is not linked to the above table. Which contains field as
Cal_Date, Cal_Month, Cal_Year, Cal_Quarter, Cal_Week
The data will be populated in this table only when there is a transaction. For example my opening stock is 100. On 1st Jan 2011 I had a transaction of sales of 20 unit of a product and then I had transaction on 5th Jan 2011 of sales of 10 units of the product and then the transaction on 7th Jan 2011 of 20 units. So my data will look like this on the date 8th Jan 2011
Product, StartDate, EndDate, Stock
ABC, 01/01/2011, 04/01/2011, 80
ABC, 05/01/2011, 06/01/2011, 70
ABC, 07/01/2011, 01/01/2099, 50
The date 01/01/2099 indicates that after the 7th Jan 2011 there are no transaction.
Now I can get the sales as on 3rd Jan 2011 by following expression.
Sum({<StartDate = {"<=$(=max(Cal_Date))"},EndDate = {">=$(=Max(Cal_Date))"}>}Stock)
This works fine for a single date selection. Now the problem occurs when I need a chart with Month as dimension and need sales for the end date of that Month.
Meaning if the Jan is the month then I need Stock as on 31st Jan 2011, and so on for other months.
I am here by attaching the sample table excel file.
I believe that I will get a solution for this problem.
Thank you all.
Regards,
Kaushik Solanki
Hi Kaushik,
as a starting point only... please find attached a proposal using interval match.
Seasonal greetings,
Stefan
Another idea:
Use Sku and monthend(Start_Date) as dimension and
=firstsortedvalue(Stock, -Start_Date)
as expression.
Regards,
Stefan
Hi Stefan,
I appreciate work and logic made by you. But the problem is that with same structure of data which I gave is almost 250 million record for 2 yr. And if i use the interval match function it will grow like anything. Thus i need the way by which I can achieve the same but with less overhead on RAM. ThusIi was thinking to have a solution with set analysis, so I dont need to increase the record and increase the RAM.
one more thing is I cant include SKU as dimension, cause I have over 6 million SKU's. and also the thing is I need to know the stock for every Monthend, by Month as dimension and Stock as Expression.
Please share if you have any other idea to achieve this.
Regards,
Kaushik Solanki
Consider the attached. At heart it's the same intervalmatch as Stefan's solution, but there are a lot of optimizations that I think will allow it to work better for your 250 million row table.
Now yes, in a sense, the new table I added grows as the cube of the number of possible dates. In practice, I think your number of rows will be drastically lower than that. I suggest trying it on your real data set. There's nothing like testing to let you know for sure. I can't guarantee it, but I suspect you'll be surprised at how little RAM it actually adds.
You could also do this with a synthetic key as in Stefan's solution. I don't know for sure that removing it will help, but I suspected that the intervalmatch might otherwise directly refer to the table with 250 million rows during the load, taking similar time to a resident load, which is to say a whole lot of time. Since the intervalmatch here refers only to fields in what should be much smaller tables, I'm guessing this will be faster.
Script below for people using personal edition.
Data:
LOAD
SKU
,Stock
,Sales
,num(StartDate) & num(EndDate) as DateRange // This needs to be added to the QVD for performance.
// We don't load StartDate and EndDate here, but they remain in the inline load to
// show that they're still part of the underlying QVD.
INLINE [
SKU StartDate EndDate Stock Sales
1 1/1/2011 1/4/2011 1 100
1 1/5/2011 1/5/2011 3 200
1 1/6/2011 1/15/2011 6 300
1 1/16/2011 1/30/2011 8 400
1 1/31/2011 2/8/2011 10 500
1 2/9/2011 2/16/2011 7 600
1 2/17/2011 2/18/2011 15 700
1 2/19/2011 3/20/2011 20 800
1 3/21/2011 3/25/2011 32 900
1 3/26/2011 3/31/2011 33 1000
2 1/1/2011 1/10/2011 16 1100
2 1/11/2011 1/16/2011 20 1200
2 1/17/2011 1/17/2011 22 1300
2 1/18/2011 1/20/2011 25 1400
2 1/21/2011 1/28/2011 28 1500
2 1/29/2011 2/5/2011 30 1600
2 2/6/2011 2/7/2011 34 1700
2 2/8/2011 2/14/2011 36 1800
2 2/15/2011 2/24/2011 40 1900
2 2/25/2011 3/31/2011 41 2000
] (delimiter is ' ');
// Autogenerates and the DateRange field are to allow us to load unique values and ranges
// without ever referring back to the original table. Since the original table is
// 250 million rows, this should save a huge amount of time in the script.
DateLinkage:
LOAD *
,date( left(DateRange,5)) as StartDate
,date(right(DateRange,5)) as EndDate
;
LOAD num(fieldvalue('DateRange',recno())) as DateRange
AUTOGENERATE fieldvaluecount('DateRange');
Min:
LOAD min(Date) as MinDate;
LOAD num(fieldvalue('StartDate',recno())) as Date
AUTOGENERATE fieldvaluecount('StartDate');
LET vMinDate = peek('MinDate');
DROP TABLE Min;
Max:
LOAD max(Date) as MaxDate;
LOAD num(fieldvalue('EndDate',recno())) as Date
AUTOGENERATE fieldvaluecount('EndDate');
LET vMaxDate = peek('MaxDate');
DROP TABLE Max;
Calendar:
LOAD *,if(Date=Month,Month) as MonthEnd;
LOAD *,date(daystart(monthend(Date)),'MMM YYYY') as Month;
LOAD date($(vMinDate)+recno()-1) as Date
AUTOGENERATE $(vMaxDate)-$(vMinDate)+1;
LEFT JOIN (DateLinkage)
INTERVALMATCH (Date)
LOAD
StartDate
,EndDate
RESIDENT DateLinkage;
Hello John,
I am thankful to you for your valuable support. I have seen the solution you have suggested. But one thing bothers me is the Datalinkage table.
Let in explain it in detail what I am thinking.
The Actual table contains around 35 million record for 3 months, when the data is in date range form i.e with StartDate and EndDate. But if we convert that into each day form i.e Datalinkage table in your example it will grow about 770 million, which will increase the application size and may go out of the memory.
So I am thinking to use any other solution like - use the set analysis where I will get the maximum date according to the dimension i.e Year or Month or week and then compare with StartDate and EndDate and get respective stock.Something like below set analysis expression.
Sum({<StartDate = {"<=$(=max(Cal_Date))"},EndDate = {">=$(=Max(Cal_Date))"}>}Stock)
Meaning I need to get the output as you have shown in your example with the help of set analysis. The reason is it will not increase any row, but it may increase the processing time, which is ok.
I hope you understood what is my actual problem. If you didn't understand it, let me know and give me one more chance to explain it in detail.
Thanks & Regards,
Kaushik Solanki
The intervalmatch solution will not grow to 770 million rows for 35 million rows of input for a 3 month time span.
First, notice that the new linkage table doesn't have the SKU in it. It depends ONLY on the date ranges. It has one row for every date in every date range. That means there's a hard and fast limit on the number of rows in the linkage table. I'm guessing it's in the order of magnitude of 100,000 rows for three months even if you have 50 billion rows in your original table. I could probably calculate it, but I suspect an example will be more convincing.
OK, I built and attached an example. It produces a maximum of 125,580 rows for 3 months of data no matter how many rows are in the original table. I ran the example with over 100 million rows just to make sure I wasn't confused. As an aside, even at 100 million rows, it took only a single second to build the date linkage table on my machine, so script performance even on a huge table should be a non-issue.
Now, we can have MUCH larger maximum sizes if your date ranges can be much longer and you still have all possible date ranges. For about 10 years of data, the maximum rows in the linkage table would be around ten billion I think, which would be a serious problem. But your real data should be MUCH better behaved than that if I understand it. I don't anticipate you'll have any problems in practice.
My suggestion is that you try it on your real data and find out. Seeing how big the new table ACTUALLY IS should be faster and more informative than debating me on how big we each THINK the table will be. I could certainly be wrong, particularly if your data doesn't look like I suspect it looks, but there's no point in debate when testing is so simple.
If testing shows the new table is too big, another data model solution would be storing stock changes on each record, and building an AsOf table linking each AsOfDate to that Date and all previous Dates. Instead of summing stock for a Date, you'd then sum stock changes for an AsOfDate. I don't think chart performance would be as good, which is what I'm typically trying to optimize, but it would be a much smaller linkage table if testing shows too many rows in the other type of linkage table.
You COULD use set analysis, but it's not as simple as you may be thinking. Sets are evaluated for the chart as a whole, not for rows on the chart. The only efficient way around this limitation that I know of is rather complicated. But set analysis per row is detailed in the following document if you really want to pursue it:
I am not sure if I understood the problem correctly. The following is a trivial solution which will work if the period between the start and end dates do not have more than one month-end dates:
Data:
LOAD *
,if(MonthEnd(StartDate)<EndDate, MonthEnd(StartDate),Null()) as MonthEndDate
,if(MonthEnd(StartDate)<EndDate, Stock,Null()) as MonthEndStock;
LOAD * INLINE [
SKU,StartDate,EndDate,Stock,Sales
1,1/1/2011,1/4/2011,1,100
1,1/5/2011,1/5/2011,3,200
1,1/6/2011,1/15/2011,6,300
1,1/16/2011,1/30/2011,8,400
1,1/31/2011,2/8/2011,10,500
1,2/9/2011,2/16/2011,7,600
1,2/17/2011,2/18/2011,15,700
1,2/19/2011,3/20/2011,20,800
1,3/21/2011,3/25/2011,32,900
1,3/26/2011,3/31/2011,33,1000
2,1/1/2011,1/10/2011,16,1100
2,1/11/2011,1/16/2011,20,1200
2,1/17/2011,1/17/2011,22,1300
2,1/18/2011,1/20/2011,25,1400
2,1/21/2011,1/28/2011,28,1500
2,1/29/2011,2/5/2011,30,1600
2,2/6/2011,2/7/2011,34,1700
2,2/8/2011,2/14/2011,36,1800
2,2/15/2011,2/24/2011,40,1900
2,2/25/2011,3/31/2011,41,2000
];
I hope this helps to arrive at a script which will work even when there are multiple mont-end dates in the no-transaction period from the start and end dates.
Hi John,
I agree with you, there is no point in debating on what we think.
I will sure try your solution on real data and will update you about the performance of the application.
Thank you very much for your valuable time and support.
Have a nice Christmas and Happy New Year.
Thanks & Regards,
Kaushik Solanki