Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
 
					
				
		
Hi,
I have an excel source with 10 columns, I need to read the column user_comment and try to find one or more keywords from a list, and input in a new column called comment_keywords. The keywords list can be any type. I have both excel and csv with comma separator key words. How would be the best approach to do that?
Thanks in advance!
 
					
				
		
Hi,
Could you please share some sample records for your scenario to get better clarity about your use case?
Please add the sample records as text or csv data instead of image so that we can copy and try to do some analysis.
Warm Regards,
Nikhil Thampi
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved
 
					
				
		
OK, this was a fun problem to look at. I should point out that the method I am demonstrating is not the only way of achieving this, but it works. I am demonstrating without Excel or CSV files (just to make it easier for me to build), but you can add these in. The job that I have created looks like below...
Your Excel is represented by tFixedFlowInput_1 and your list of Strings is represented by tFixedFlowInput_2.
What I am doing is using a tMap with no joins between the Main input and the Lookup. This means that every Main row will be tested against every Lookup row. I then use some logic in tMap variables to identify which Main row is being processed and when the Main row changes, to wipe the list of matched Strings I have found. I also use a Routine (essentially a Java method) to assess whether each word appears in the String. The tMap looks like below....
The order of the tMap variables is very important. The variables are processed from top to bottom and they can keep their values between rows. Each of the variables is described below....
| Expression | Type | Nullable | Variable | Description | 
| Var.ThisText | String | Yes | LastText | This is set to the value of the ThisText variable. For the first row, it will be null as it will be set before ThisText is set. This is used to help identify when we have finished checking for all of the Lookup Words. | 
| row1.Text | String | No | ThisText | This is set by your incoming Text | 
| Var.LastText==null || Var.LastText.compareToIgnoreCase(Var.ThisText)!=0 | Boolean | No | NewText | This is a check to see whether we are now checking a new piece of text. If so, it tells us how to deal with the next variable. | 
| Var.NewText ? routines.TestRoutine.findWordsInString(row1.Text, row2.Words,",") : Var.WordsList + routines.TestRoutine.findWordsInString(row1.Text, row2.Words,",") | String | Yes | WordsList | This is used to create the list of words found. Pay attention to the Var.NextText check. If it is a new Text, it will overwrite the WordList variable. Otherwise it just adds to it. This uses a Routine that I will show below. | 
The routine that I have written for this covers several things. When I was initially testing this, I found that words such as "at" could be found in words like "what", "that", etc. So in order to make sure this doesn't happen, I wrote this to strip out all punctuation and text the words of the sentence one by one. The whole word. The routine method can be seen below. I added it to an existing "TestRoutine" Routine. You will need to create one and add the following code....
	public static String findWordsInString(String data, String word, String separator) {
		String[] wordsInData = data.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
		String returnVal = "";
		
		for(int i=0; i<wordsInData.length; i++) {
			if(wordsInData[i].compareToIgnoreCase(word.toLowerCase())==0) {
				returnVal = word+separator;
				break;
			}
		}
		
		return returnVal;
	} 
Once you have this, the last step is to tidy up the results. If you have 4 Main rows and 10 Lookup rows, this job will return 40 rows. To prevent this I have added a tAggregateRow component which is grouped on "Text" and the function on WordsList is set to Last. This essentially returns the last row for each group of Text.
Let me know how you get on.
