Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I've an ETL job which processes a raw file into a MySQL DB. While doing some analysis, I realized that I need data for all the days in a month and my data feed only has values for weekdays.
I need a way to embedd records for the missing days, with data from the last day. e.g. for Saturday and Sunday, I'd want to use data from Friday. For missing Wednesday, I can use data from Tuesday.
Visually this is what I am looking at:
| Day | Old Table | New Table |
| Monday | Record 1 | Record 1 |
| Tuesday | Record 2 | Record 2 |
| Wednesday | Record 3 | Record 3 |
| Thursday | Record 4 | Record 4 |
| Friday | Record 5 | Record 5 |
| Saturday | Record 5 | |
| Sunday | Record 5 | |
| Monday | Record 8 | Record 8 |
| Tuesday | Record 9 | Record 9 |
| Wednesday | Record 10 | Record 10 |
| Thursday | Record 11 | Record 11 |
| Friday | Record 12 | Record 12 |
| Saturday | Record 12 | |
| Sunday | Record 12 | |
| Monday | Record 15 | Record 15 |
Could someone recommend me a way?
My vague thoughts are converging around working with three datasets:
1) Temp data set with all the date values in it (say start of year till end of year)
2) Original Data which contains missing records (dtst 1)
3) New data, which has all the values (dtst 2)
I don't know how to but if I could loop through all the values in the temp dataset (dtst 1), pull value against the date from the original data and push it in the new dataset (dtst 2). If I don't identify a value in dtst 1, I use the last row fetched from the dtst 1 and push it in the dtst 2. At the end of the loop execution, I'll have data for all the dates in my dtst 2, which I can later use for my analysis.
Hi,
One possible simple solution is as below.
a) Read the data set containing date values for full year (you can use multiple methods to generate this data set. For simplicity, lets assume, its a file). This will be your main flow.
b) Use a tMap with left outer join and read the existing input file (dtst 1) as lookup. Join based on the date value and the output will be records with values for some dates and null for some other dates
c) Pass the data to tSortrow to arrange the records according to date.
d) Pass the data to tFlowtoIterate to send the records one at a time
e) Connect a tJavarow and read the value coming for each date and store in a context variable. In the next loop, the value for the next day will be cross checked whetehr it is null. If it is null, the value will be replaced with the value in context variable (from previous day).
f) Write the output of tJavarow to a tfileoutputdelimited in append mode
Please note that this is just one way of doing the process. If the answer has helped you, could you please mark the topic as solution provided? Kudos are also welcome.
Warm Regards,
Nikhil Thampi