Talend and Amazon Comprehend Integration

TalendSolutionExpert — Tue, 23 Jan 2024 02:35:30 GMT

This article shows how seamlessly Talend integrates with Amazon Comprehend, a natural language processing (NLP) service from AWS. It is an in-depth guide on how to use Talend to harness the dominant language detection and sentimental analysis capabilities of Amazon Comprehend.

The article is a continuation of the Talend AWS Machine Learning integration series. You can read the previous article, Introduction to Talend and Amazon Real-Time Machine Learning, in the Talend Community Knowledge Base (KB).

Content:

Environment for Talend and AWS

This article was written using Talend 7.1. However, you can configure earlier versions of Talend with the logic provided to integrate Amazon Comprehend.

Currently, Amazon Comprehend is only available in selected AWS regions. Talend recommends verifying the availability of the service from the AWS Global Infrastructure, Region Table, before creating the overall application architecture.

Practical use cases

This section discusses practical use cases where Talend can help in dominant language detection from incoming data and sentimental analysis of input data by integrating with the Amazon Comprehend service.

Automatic multilingual support application

For multinational corporations, or companies working in an operational environment, where end users are interested in communicating in their native language, it would be ideal to have a multilingual support application. Talend and Amazon Comprehend, help to categorize the support cases based on the dominant language present in the customer requests.

The diagram above describes the various stages present in the overall flow and Talend helps to simplify the application with a graphical application design interface. The various stages involved in the flow are:

End users communicate their queries and concerns through a web site in the language of their choice. In the example, queries are in the English, French, German, and Italian languages through various web servers. In the absence of language identification, web servers are usually mapped based on their IP addresses. So, if an English-speaking person would like to raise a ticket from Paris in France, it would typically go to a French Support system.
In the current layout, the data from the web servers is transmitted to various Producer queues where Kafka handles the queue systems.
Talend has in-built components to read and fetch the data from the Kafka queues.
Talend performs the request call to the Amazon Comprehend dominant language detection service by transferring the input text.
Talend receives the response from the Amazon Comprehend language detection service in JSON format.
Talend parses the JSON and identifies the dominant language. If the Amazon Comprehend service has sent multiple languages in the results set, Talend parses the JSON, extracts the JSON values for each language, sorts the data based on the score in descending fashion, and selects the highest-ranking language among the various scores available in the results set. Talend transmits the data to the corresponding consumer Kafka queues based on the dominant language criteria.
Support staff from corresponding language service attends the request from the customer and provides feedback and resolution in the language of their choice automatically.

Real-time sentiment analysis dashboard

Customer sentiment analysis of a company is crucial in today’s highly competitive corporate world. Talend, Amazon Comprehend, and Snowflake help to perform real-time sentiment analysis from customer data feeds generated from multiple source systems.

The diagram above depicts the various stages involved in a customer sentiment analysis dashboard. The various steps involved in the flow are:

Customer comments are captured by the company web servers or feeds from third party web sites.
The incoming data is transmitted to various Producer queues maintained by Kafka.
Talend reads the input data from Kafka using native Kafka components.
Talend processes the inbound data from the Kafka queues and transmits it to Amazon Comprehend as a request for sentiment analysis.
Talend receives the response from Amazon Comprehend sentiment analysis.
Talend parses the response data from Amazon Comprehend in JSON format and evaluates the overall sentiment and individual scores for positive, negative, neutral, and mixed. The parsed data, along with input text, is transmitted from Talend to Snowflake Cloud Data warehouse using native components.
Once the data is loaded to Snowflake, real-time dashboards showing overall customer sentiments are generated from Snowflake.

Note: The above scenarios are simple illustrations of data flow solely based on language detection and sentiment detection of the input data. Talend recommends applying additional data privacy-related rules, such as GDPR, on top of the current layout using Talend, through its easy to use the graphical interface.

Configure a Talend routine for Amazon Comprehend

Create a Talend user routine, by performing the following steps. Both dominant language detection and sentiment analysis functionalities are embedded under the same Talend routines as multiple Java functions.

Connect to Talend Studio, and create a new routine called AWS_Comprehend that connects to the Amazon Comprehend service to transmit the incoming input text and collect the response back from the Amazon Comprehend service.

Insert the following code into the Talend routine:

package routines;

//Amazon SDK 1.11.438

import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.services.comprehend.AmazonComprehend;
import com.amazonaws.services.comprehend.AmazonComprehendClientBuilder;
import com.amazonaws.services.comprehend.model.DetectSentimentRequest;
import com.amazonaws.services.comprehend.model.DetectSentimentResult;
import com.amazonaws.services.comprehend.model.DetectDominantLanguageRequest;
import com.amazonaws.services.comprehend.model.DetectDominantLanguageResult;

import org.apache.commons.logging.LogFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.annotation.JsonView;
import org.apache.http.protocol.HttpRequestExecutor;
import org.apache.http.client.HttpClient;
import org.apache.http.conn.DnsResolver;
import org.joda.time.format.DateTimeFormat;

public class AWS_Comprehend {
	
	
public static String Dominant_Language(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName,String input_text) 
{
BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);

AmazonComprehend comprehendClient = AmazonComprehendClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(awsCreds)).withRegion(AWS_regionName).build();

// Call detectDominantLanguage API
DetectDominantLanguageRequest detectDominantLanguageRequest = new DetectDominantLanguageRequest().withText(input_text);
DetectDominantLanguageResult detectDominantLanguageResult = comprehendClient.detectDominantLanguage(detectDominantLanguageRequest);
		        
String response_JSON=detectDominantLanguageResult.getLanguages().toString();
return response_JSON;
}

public static String Sentiment_Detection(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName,String input_text, String language_code) 
{
BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);

AmazonComprehend comprehendClient = AmazonComprehendClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(awsCreds)).withRegion(AWS_regionName).build();

// Call Sentiment Detection API
DetectSentimentRequest detectSentimentRequest = new DetectSentimentRequest().withText(input_text).withLanguageCode(language_code);
String response_JSON=comprehendClient.detectSentiment(detectSentimentRequest).toString();
return response_JSON;
}
		        
}

The Talend routine needs additional JAR files. Install the following JAR files in the routine:
- AWS SDK 1.11.438
- apache.commons.logging
- Jackson core 2.9.7
- Jackson Annotations 2.9.4
- Jackson Databind 2.9.7
- httpcore 4.4.10
- httpclient 4.5.6
- joda-time 2.9.4
Add additional Java libraries to the routine by selecting Edit Routine Libraries.
Select New in the pop-up window to add libraries to the routine.
Select Artifact repository(local m2/nexus), then select Install a new module.
Select the JAR file from the local drive.
Select Detect the module install status to verify whether the module is already installed.
If the JAR file is not installed, the status changes from the error flag to Install a module followed by JAR file name. Click OK to load the JAR file to the routine. Once all the JAR files are installed, click Finish.

The setup activities are complete. The next section shows sample Jobs for the functionalities described in the practical use cases.

For ease of understanding, and to keep the focus on the integration between Talend and Amazon Comprehend, the sample Jobs use text files for input and a tLogrow component for output.

Talend sample Job for dominant language detection

The sample Job, Language_Identifier.zip, attached to this article, reads the data from the input file and transmits the message to the Amazon Comprehend service. The response from Amazon Comprehend service, in JSON format, is parsed, sorted, and the row with the highest score for dominant language for each inbound text record is published in the console.

The configuration details are as follows:

Create a new Standard Job called Language_Identifier.
The first stage in associating the routine to a Talend Job is to add the routines to the newly created Job, by selecting Setup routine dependencies.
Add the AWS_Comprehend routine to the User routines section of the pop-up screen, to link the newly created routine to the Talend Job.

Note: You must perform this step for both of the Jobs mentioned in this article.
Review the overall Job flow, shown in the following diagram.
Configure the context variables, as shown below:
The input file for the Job, detect_language_input.txt, attached to this article, contains the phrase, I am very happy today, and is translated into multiple languages using Google translator. The last line of the file has both English and Spanish words added intentionally to measure the difference in scoring pattern when the input data has multiple languages.
Configure the tFileInputDelimited component, as shown below:
Use the tMap component where the call to Amazon Comprehend service is made through Talend routine. You will have to pass the parameters mentioned in the code snippet in the same order as the function call in the tMap component.
```
AWS_Comprehend.Dominant_Language(context.AWS_Access_Key, context.AWS_Secret_Key, context.AWS_regionName, row1.input_text)
```
Configure the tMap component layout, as shown below:
The output from the Amazon Comprehend call is a string in JSON format. If there are multiple languages present in input text, the output JSON has a score for each associated language. The language code and corresponding scores are parsed to the variables, as shown below. Leave the columns id and input_text empty because you are going to map them directly from the input flow.
Notice that the score is converted to Double in this stage.
Sort the output data according to the id (in ascending order) and the score (in descending order) columns.
Using a tUniqrow component for each id, pick the first record that has a maximum score.
The output data from the previous stage has code values for languages. The mapping of code values to the corresponding language names from the Amazon site is in the language_ref_code.txt file, attached to this article. Use this file as a lookup before printing the output results.
The inbound data is joined with the reference file, where Join Model is selected as Inner Join. The data is passed to the tLogrow component to print the output in the console.
Review the dominant language and the corresponding score for each input text. Note that the score of the last row is different from the other rows because the input sentence is a mix of English and Spanish.

In practical scenarios, the output at this stage can be passed to downstream systems to by channeling through different data flows based on the corresponding language of the sentence.

Talend sample Job for sentiment analysis

The sample Job, Sentiment_Analysis.zip, attached to this article, extracts the input text from the CSV file and performs a call to the Amazon Comprehend sentiment analysis service. The output from the service is parsed and displayed in the console.

The configuration details are as follows:

Create a new Standard Job called Sentiment_Analysis. The new user routine, AWS_Comprehend, is attached to the Job as shown in previous example. The following diagram shows the overall Job flow:
Data in the sample text file, sentiment_analysis_input.txt, attached in this article, has an id and input_text, with different sentiments, for each record.
Use a tFileInputDelimited component to configure the input file, as shown below:
Talend calls the Sentiment_Detection function of the AWS_Comprehend routine in the tMap component, as shown below. This transfers the data from Talend to Amazon Comprehend and sends the responses back to the sentiment_results field.
```
AWS_Comprehend.Sentiment_Detection(context.AWS_Access_Key, context.AWS_Secret_Key, context.AWS_regionName, row1.input_text,context.language_code)
```
Configure the tMap component, as shown below.
The output data from the tMap component has the sentiment analysis results from Amazon Comprehend but the results are in JSON format. Use the tExtractJSONFields component to parse the overall sentiment of the text, positive sentiment score, negative sentiment score, neutral sentiment score, and mixed sentiment score along with original input fields, id and input_text.
Notice that the data type of fields with sentiment scores are converted to a Double data type for any further analysis.
Review the data printed in the output console. The overall_sentiment column provides the sentiment of the input text and the four columns after that provides the individual scores for each sentiment.

Threshold limits for data processing

At the time of this writing; Amazon Comprehend can handle 5,000 UTF-8 characters per document. Talend recommends that you always verify the latest performance benchmarks on the AWS Documentation, Guidelines and Limits page, and that you provide a minimum of 20 characters per input text for best results from Amazon Comprehend service.

Amazon Comprehend dominant language detection is currently available for 100 languages, and Amazon Comprehend sentiment analysis is available for English, French, German, Spanish, Italian, and Portuguese languages. Refer to the AWS Documentation, Languages Supported in Amazon Comprehend page for the latest list.

Conclusion

This article depicts use cases of integrating Talend with Amazon Comprehend service. In real time scenarios, data input flow is in the form of web services or queues instead of input files mentioned in the sample Jobs.

Citations

AWS documentation, Amazon Comprehend

article Talend and Amazon Comprehend Integration in Official Support Articles