Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
PhilHibbs
Creator II
Creator II

Standardization of equivalent email addresses

Is there a component or library to standardize emails, according to RFC5322, e.g. stripping out comments? For instrance, all these email addresses should be considered the same address, and should be standardized to "bob@domain.com":

 

bob@domain.com

bob(1)@domain.com

(robert)bob@domain.com

b((o)o)ob@domain.com

b(o)o(o)b@domain.com

 

Nested comments mean you can't do this with a simple regular expression, it needs a proper descent parser. Unless regex can do that, if so it's beyond my fu.

Labels (2)
5 Replies
Anonymous
Not applicable

Hi,

 

    Could you please use the tmatchgroup component to group email ids based on your matching rules? 

0683p000009M0YM.png

 

 

There are multiple matching algorithms available in this component or you can even create an algorithm of your choice.

0683p000009M0ug.png

 

 

The choice of the algorithm depends on your use case and I would suggest you to verify the results from this component for each algorithm to familiarize yourself with each of them.

 

Warm Regards,

 

Nikhil Thampi

PhilHibbs
Creator II
Creator II
Author

I can't see anything there that will deal with nested parentheses.

Anonymous
Not applicable

Hi,

 

    You can remove the unnecessary nested parenthesis by a replace function in tmap. Since name will never have them, you can safely remove them and after that apply the algorithms for matching.

 

   If the answer has helped you, could you please mark the topic as resolved? Kudos are also welcome 🙂

 

Warm Regards,

 

Nikhil Thampi

PhilHibbs
Creator II
Creator II
Author

Ok it looks like my only option is to manually code it in Java. I suppose I could write a loop to keep applying a regex such as \([^\(]*?\), which matches an opening parenthesis followed by the next closing parenthesis that does not have another opening parenthesis before it, until no match is found.

 

Is there a Talend component that will apply an expression in a loop, or do I just hard code it in a tJavaRow?

Anonymous
Not applicable

Hi,

 

    Since you are having lot of data related issues, the easiest way to add them will be in tjavarow.

 

Warm Regards,

 

Nikhil Thampi