Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

tmap the same table twice / data wharehousing insert recommendations

Good afternoon,
I've been searching everywhere for the past two days without success and decided to post my question here.
I'm building a data wharehouse for a course and I'm having some trouble.
The facts are flights which have a time, airplane, origin_airport, dest_airport, origin_climate and dest_climate dimensions.
I'm having trouble with mapping. I've been duplicating the csv input for the climate and airport data as I need it twice in the mapping but I'm not sure this is the correct way because in the tmap output I'm also using two tables for climate and airports. And after that I can't join them and remove the duplicates as they come from the same "input"(or similar word) and tUnite doesn't allow that.
It doesn't need to happen all in the same tMap, I just thought it would be simpler/quicker this way. The output of the tMap operation preferably would be a table of flights, a table of all airports(origin and dest with/without duplicates) and a table of all climate (same dest and origin together).
On another unrelated question, how should I proceed with populating the data warehouse? Should I populate all dimensions and add the facts later or try and insert each fact and if its dimension value doesn't exist, I add them first (to get the foreign keys) and then add the fact. I don't which way is better and also which components to use (one which allows the primary key to be the result of a sql query?)
Any help with these two problems would be great.
Thanks in advance.
Labels (2)
5 Replies
bkar81
Contributor III
Contributor III

Have you tried using tHashInput and tHashOutput components? By using these components, you can use the same input multiple times. Please check it out.
Its better to load the dimensions first and then load the fact as it is unnecessary to refer to the dimension and facts multiple times.
Anonymous
Not applicable
Author

bkar81, thanks for your help but I need a little more. Here's a screenshot of the current setup: http://imgur.com/7RUP3sO (not allowed live urls yet).
Both AirportsDest and AirportsOrigin are the same exact file because in the tMap I need one FK of the aiports to be in the Destination column and another one in the Origin column but I can't make it an Or operation (also wouldn't make sense since I need to know the airport for both origin and destination for each flight). Currently I'm mapping them to their own tables in the tMap (one for origin airports and another for destination airports) but like I said, I need them in the same table (without duplicates preferably).
Your solution for tHashMap didn't seem to work. I added a tHashOutput after both airport tables and gave them both the same name but in the tHashInput, when selecting the source, it would show me both tHashOuputs and force me to chose one of them, missing all the other data.
Thanks for your help so far.
Anonymous
Not applicable
Author

Hi,
It seems that you have only one input file and want to separate this input to fact and dimensions. If this is the case, your logic should work... Can you please upload the screenshot of tMap by clicking on Post Reply and upload your image file for tMap?.
Vaibhav
willm1
Creator
Creator

Good afternoon,
I've been searching everywhere for the past two days without success and decided to post my question here.
I'm building a data wharehouse for a course and I'm having some trouble.
The facts are flights which have a time, airplane, origin_airport, dest_airport, origin_climate and dest_climate dimensions.

I would approach the design of such a datamart a bit differently. All airports should be in one dimension, regardless of whether they're origin or destination. As you rightly said, you're struggling because an airport can be both at some point or for some flights. This dimension would then be role-playing on the Fact table. You'd have one field on the Fact that that connects to the AirportDim to get the Origin SK and another field on the Fact to get the Destination SK. In this way, you'd load all your airport dimension information in your single AirportDim, and reference it when you're loading the Fact table.
Of course you'd have an AirplaneDim and a TimeDim and DateDim (Dim - Dimension).
As for weather, depending on whether you have one or more observations (wind, temp, cloud cover etc...) you could choose to make this a degenerate dimension in the Fact table (meaning the weather metric is in the Fact table) or in a separate Dimension. It's best to seperate them in a dimension. Again, I'd create a WeatherDim that has for a business key the AirportName or ID, that way when I'm loading, I'm expiring the previous record for an airport and entering a new weather record for it. In effect, weather would be a Type 2 SCD and you can use the Talend component to do this. Similar to Origin and Dest Airport, I'd have 2 SK fields on the Fact table that point to the Origin weather and the DestinationWeather. If you're a fan of snowflaking, you could make the weather dim a snowflake of the AirportDim (I'm just pointing that out - snowflaking is typically frowned upon in the Kimball Methodology).
0683p000009MDoL.png
willm1
Creator
Creator

On another unrelated question, how should I proceed with populating the data warehouse? Should I populate all dimensions and add the facts later or try and insert each fact and if its dimension value doesn't exist, I add them first (to get the foreign keys) and then add the fact. I don't which way is better and also which components to use (one which allows the primary key to be the result of a sql query?)

It's best to load your dimensions first, then do your lookups to your Dimensions when loading your Facts, so you can get your surrogate keys (SKs)....