I need to calculate YTD totals for this year and previous year. Fiscal year runs from october 2010 -->september 2011.
I created a SQL view to calculate YTD totals which works fine but due to performance issues, I am trying to convert it into a ETL. SQL is similar to as below:
select a.yr, a.Dt, a.x1, a.x2, sum(b.cnt) as ytd_total from Fact_tbl a join Fact_tbl b
on a.yr = b.yr and
a.x1 =b.x1 and
a.x2=b.x2 and
a.dt >= b.dt
join time_dim c
on a.yr=c.yr
and a.dt=c.date
and (a.dt between c.year_start_date and c.year_end_date)
group by a.yr, a.Dt, a.x1, a.x2 )
So, in my ETL I am using a tmap and doing a inner join between 2 fact_tbl which have data for current year and previous year. Also, I am doing a inner join between the lookup flow of fact_tbl and time_dim. After this I am having a hard time implementing the below:
1. condition to check if Dt is between the year start & end dates.
I have got a filter on the input file in tmap which verifies that the Dt field is a date after the Year_start_date and before Year_end_date. I feel this condition is not working.
2. Calculating ytd_total.
I am using this calculation for ytd_total.. row9.Dt.after(row8.Dt)|| row9.Dt.equals(row8.Dt)? (row8.cnt+ row9.cnt): 0
I dont think this is the right way to calculate the totals as I need to do a group by too.
Please let me know if you have any thoughts on doing this?
Regards,
What are the settings for each of the lookups in tMap? In particular, the Match Model in both cases should be All Matches (as well as the Join Model being Inner Join).
Attached are the screen shots of the flow. Image 1 is the ETL flow Image 2 is the tmap inputs in which I verify that the Dt field is a date after the Year_start_date and before Year_end_date Image 3 is the tmap outputs in which I am doing a calculation for ytd_total in it. I tried using tAggregateRow too, but it just gives me the total for that day but not Year to date totals.
Difference between your SQL and tMap are:
SQL tMap
1. Joined op1 to op2 on YR, X1 and X2 Joined on all of YR, X1, X2, X3 and most importantly DT
2. Condition for tm is on op1.DT Condition is on op2.DT
3. Between tm.YEAR_START_DATE and Condition is exclusive of tm.YEAR_START_DATE and
tm.YEAR_END_DATE is inclusive tm.YEAR_END_DATE
4. Summing op2.cnt Summing op1.cnt+op2.cnt
(For efficiency, you could move the comparison between op1.DT and op2.DT to be a filter on op2 and use either !op1.DT.before(op2.DT) or op1.DT.compareTo(op2.DT)>=0.)
Then you need tAggregateRow to group by op4.YR, op4.DT, op4.X1 and op4.X2 and sum op4.cnt.
Hello Anut,
you should also be careful when calculating YTD with joins.
Normally you group by year and month of the transaction, but say you have a transaction for January (amount = 100) and one for April (amount = 150).
Your query would give you :
YTD January = 100
YTD April = 250
All the other YTD values (feb, mar) are not calculated and eventually assumed as zero, which is wrong.
There is a SQL workaround for this, which is to create a cross product join with a table with the 12 moth, but performances are badly affected as you can imagine.
For this reason in ETL different techniques must be applied.
We developed a component to do that, but it's not released to the public yet, it is based on a java library (which we produced) that does not use joins to do the calculations, instead it uses a memory buffer and it scans all the periods.
See in the attach, as an example month 2007 08 did not have values, still YTD is calculated
BR
Francesco
Thanks for the reply.
In the ETL before, I am doing a cross join between fact & time dimension table. Doing so, for each "group by" combination I am creating a record for every date. so my data looks similart to this.
yr dt x1 x2 cnt
2010 10/1/2009 toy1 ca 1
2010 10/2/2009 toy1 ca 8
2010 10/3/2009 toy1 ca 0
2010 10/4/2009 toy1 ca 0
' ' '
' ' '
' ' '
2010 9/28/2010 toy1 ca 0
2010 9/29/2010 toy1 ca 0
2010 9/30/2010 toy1 ca 5
On this data, I am doing a self join.
Using SQL, I get the values as below:
yr dt x1 x2 cnt
2010 10/1/2009 toy1 ca 1
2010 10/2/2009 toy1 ca 9
2010 10/3/2009 toy1 ca 9
2010 10/4/2009 toy1 ca 9
' ' '
' ' '
2010 9/28/2010 toy1 ca 9
2010 9/29/2010 toy1 ca 9
2010 9/30/2010 toy1 ca 14
Using taggregateRow, I get the values as in input only which is my issue.
yr dt x1 x2 cnt
2010 10/1/2009 toy1 ca 1
2010 10/2/2009 toy1 ca 8
2010 10/3/2009 toy1 ca 0
2010 10/4/2009 toy1 ca 0
' ' '
' ' '
2010 9/28/2010 toy1 ca 0
2010 9/29/2010 toy1 ca 0
2010 9/30/2010 toy1 ca 5
As per the example screen shot, I understand the new component will help me in acheiving what I want to. Not sure, if this can be waited until the availability of the component. Can you provide me the techniques I can use??
Please let me know.
Thanks again for your time.
Have you fixed your tMap to correct the difference I pointed out? Put up a new screenshot and one of tAggregateRow. Are skipped dates a problem with your data? It's easy enough to solve this with existing Talend components if need be.
attached are the screenshots with modifications of differences you pointed out.
skipped dates is not my problem. as mentioned in earlier example, issue is data from taggregateRow is similar to my input data . I am not getting a upto date total.
Thanks for your time.
You still have joined op1 to op2 on all of YR, X1, X2, X3 and DT. Of course you are not getting a total across multiple op2 DTs since you are joining on the date i.e. op1.DT=op2.DT. You also still have your condition for tm excluding YEAR_START_DATE and YEAR_END_DATE, whereas in your SQL "Between" would include them.