Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all,
Just for an example, consider the following data
I want to apply LIKE operator on NAME and CITY and Equal operator on STATE and ZIP columns so that I can expect following output :
Monroe Township, NJ | Monroe Township | NJ | 08831 ........ i.e. First occurrence only.
I've tried tFilterRow component but don't know how to apply it for this requirement.Which component or steps or function should I apply to get this result ?
Thanks !
This sounds like a tFuzzyMatch component solution IF you cannot define some logic to make this A LOT quicker and more accurate.
For example, you gave the following.....
Monroe Township, NJ | Monroe Township | NJ |
Now can you say that everything before a comma should match the second column? If so, just use String manipulation. It makes sense to consider applying some pre-processing rules to this before going down the tFUzzyMatch route.
The problem you have is that you are assuming a "world knowledge" of a human. Computers can't work like that (here is where my AI degree comes into play 🙂 ).
Consider the numbers 1 and 11. They are not "like" each other to us or a computer....they are very different. Now consider 1111111111111111111 and 11111111111111111111. To us (on first inspection) they look "like" each other...until we actually count the 1s. To a computer, they are different. They are massively different. Now if we change numbers to text, our brains automatically spot patterns. So the following text is seen as "the same".....
Hello my name is Richard
Hello my name si Richard
We autocorrect (which is both good and bad), a computer won't. To a computer that is just a series of bits without a context. That is why "Like" is such a difficult task.
It has been solved by many mechanisms, but they are not always very efficient or easy to implement in Data Integration. What I was suggesting was that you look for rules to apply. For example, if you make the Strings uppercase, remove leading and trailing spaces, etc. Once you have done that, then you *might* be able to use Java String functionality like "indexOf" (https://docs.oracle.com/javase/7/docs/api/java/lang/String.html).
However, if you cannot apply these rules you may have to use Fuzzy Matching. This is a clever mechanism, but requires a lot of work to get it perfect...if you can get it "perfect" at all.