Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am having difficulty with the following use case. In tMap an inner join with an expression filter on a date range on the input side and Catch lookup inner join reject = true on the output side is not getting picked up as an inner join reject.
Can someone recommend a work-around or alternative solution?
The use case is trying capture Late Arriving Dimensions on an Slowly Changing Type-2 dimension table. This requires a date range check to return the correct dim_id at the time of the transaction.
Below is a simplified version of what I need to do.
Inputs:
trx.csv
trx_num |
trx_dim_key |
trx_date |
101 |
aaa |
2015-03-01 |
100 |
aaa |
2017-02-01 |
dim.csv
dim_id |
dim_key |
eff_date |
end_date |
group_name |
87 |
aaa |
2016-01-01 |
2016-12-31 |
domestic-1 |
88 |
aaa |
2017-01-01 |
2017-12-31 |
intl-a |
89 |
aaa |
2018-01-01 |
2018-12-31 |
domestic-2 |
I want the output file to contain each row from trx.csv where there is no row in dim.csv with a matching trx_dim_key = dim.dim_key AND the trx.trx_date is between dim.eff_date and dim.end_date.
Using the sample inputs above, the expected output from Catch lookup inner join reject is true should be:
trx_num |
trx_dim_key |
trx_date |
101 |
aaa |
2015-03-01 |
Instead, Iam getting this:
trx_num |
trx_dim_key |
trx_dt |
101 |
aaa |
2015-03-01 |
101 |
aaa |
2015-03-01 |
101 |
aaa |
2015-03-01 |
100 |
aaa |
2017-02-01 |
100 |
aaa |
2017-02-01 |
I have tried many ways to acomplish this in tMap without success. What I expected to work is the following:
See Screenshots attached.
tMap:
Left (input) side:
Main = trx.csv (trx)
Lookup = dim.csv (dim)
Match Model: All matches
Join Model: Inner Join
Expression Key |
Column |
trx.trx_dim_key |
dim_key |
Expression Filter:
TalendDate.compareDate(trx.trx_date,dim_lookup.eff_date) >= 0 &&
TalendDate.compareDate(trx.trx_date,dim_lookup.end_date) < 0
Middle (variables): empty
Right (output) side:
Catch output reject: false
Catch lookup inner join reject: true
Note that the Match Model is All matches, as the trx_date can be an historic transaction or a recent transaction, so a match model of first or unique will not work in this scenario.
The expected output using the sample inputs above is:
trx_num |
trx_dim_key |
trx_date |
101 |
aaa |
2015-03-01 |
It seems to be ignoring the input Expression Filter as part of the INNER JOIN and that is why I believe this is a bug.
To prove to myself that the input Expression Filter is considered part of the inner join, I modified the above test as follows:
Added another row to trx.csv with trx_dim_key = 'bbb' (which is not in dim.csv)
Removed the join criteria from Expr. Key / Column
Replaced the input Expr. filter with the join criteria: trx.trx_dim_key.equals(dim_lookup.dim_key)
If the input Expr Filter is not considered part of the inner join, then no rows should have been caught by the inner join reject, as there is no join criteria (except what is in the Expr. Filter). But, in this case Talend did consider the Expr. Filter in the inner join logic and rejects the non-matching rows, but when I used a more complex Expr. Filter with the date range logic, it does not reject the non-matching row.
Any thoughts are appreciated.
Thanks.
the results are correct for your current tmap settings. You have a one to many relationship for your join condition. You don't want this. So in tmap change the match model from "unique match" to "first match".
removed duplicate post.
removing duplicate post.
Hello bclstu,
I had some difficulties with dates in join conditions myself, similar to your question also with date ranges. I had tried to use the expression filter on the input side (lookup) as well and it had not worked as expected. (I suspect that this expression filter is applied only once when creating the lookup cache data and not for each main input row. I use "load once" for lookups, not "load at each row".)
So I have moved such expression filters or comparisons to the "var" panel in the middle. I filter their results on the output side on the right.
On the input side, keep the (inner) join on the dim key and the all matches setting. On the output side, do not use "inner join reject", instead filter according to the expression results from the "var" panel.
Hope this is helpful or gives you some ideas to try.
Best regards,
Thomas
Hi Thomas,
Thanks for the reply. I tried submitting the issue to Talend Support as a bug, but after about 1 week the response was that they do not think it is a bug, and, as you said, the input filter is applied to the results of the inner join. Talend's tMap documentation is ambiguous as to how the input filter is applied.
I have tried filtering on the output side, but no luck. I also tried playing around with the 'catch output reject' (true/false) setting, without success.
It's frustrating - if both input sources (trx and dim) are relational tables, it is a fairly basic sql query, something like this:
select * from trx where not EXISTS
(select 1 from dim where trx.trx_dim_key = dim.dim_key
and trx.trx_date between dim.eff_date and dim.end_date);
I did come up with a work-around but it is convoluted and and a drag on performance.
1. Capture the 'good' transactions (key match and in date range) in a temp file via the tMap and and output filter
2. Re-process the trx file and inner join with the saved temp file as the lookup, with capture inner join reject = true.
So the large trx file has to be processed twice. If you have many Type 2 Slowly Changing Dimensions, and need to capture Late Arriving Dimension members for them, this gets unwieldy quickly.
I have asked Talend support to provide an alternate solution. I will post back if I get a good solution.
Hi bclstu,
please see the attached job for an example of how I would solve it. Feel free to adapt it to your needs.
Best regards,
Thomas
Hi Thomas,
Thanks for putting that together. I appreciate it!
First glance, I'm thinking that I can get the unique set of trx_dim_key and trx_date rows from the date_not_ok output and run it through another tMap inner join against the row_ok output and capture inner join rejects. This along with the rejected_dim_key output provides the set of data needed. Still not a clean process (similar to my current work-around solution) . I'm still considering if it will perform better than my current work-around, which re-joins the date_not_ok output against the initial trx file.
Thanks again.
bclstu