Talend and Amazon Transcribe Integration

TalendSolutionExpert · Jan 22, 2024 9:35:30 PM

Overview

This article depicts the integration of Talend with Amazon Transcribe, an Automatic Speech Recognition (ASR) service that helps customers create speech-to-text capability in their overall application flow.

The article is a continuation of the Talend Amazon Web Service Machine Learning integration series. You can read the previous articles, Introduction to Talend and Amazon Real-Time Machine Learning, Talend and Amazon Comprehend Integration, Talend and Amazon Translate Integration, Talend and Amazon Rekognition Integration, and Talend and Amazon Polly Integration in the Talend Community Knowledge Base.

Environment for Talend and AWS

This article was written using Talend 7.1. However, you can configure earlier versions of Talend with the logic provided in Amazon Transcribe.

Currently, Amazon Transcribe is only available in selected AWS regions. Talend recommends verifying the availability of the service from the AWS Global Infrastructure, Region Table, before creating the overall application architecture.

Talend recommends reviewing the Amazon Transcribe list of supported languages on the Transcribing Streaming Audio page.

Practical use case

This section discusses a practical use case where Talend can help in the audio to text conversion of customer interactions by integrating with the Amazon Transcribe service.

Customer audio interaction conversion application

In today's era, customers are more eager to have audio conversations and to experience live customer support. This tendency results in more documentation for call center executives and Interactive Voice Response (IVR) systems since they have to document the audio interaction. Manual documentation of audio interactions is costly and is typically only implemented by companies with a sufficient workforce.

Amazon Transcribe helps to increase the throughput and quality of voice to text conversion of customer audio interactions without any manual interventions. It helps companies of all sizes to migrate to detailed documentation of voice calls.

The diagram above illustrates the various stages in the overall flow, and how Talend helps to simplify the use case with its signature graphical application design interface and data orchestration capabilities. The stages in the flow are:

Audio channels capture the interaction between Customers and IVR systems or Customer service executives, and the interaction is saved as audio files for downstream processing.
Talend extracts the audio files from the landing area.
Talend transfers the audio files, using its Amazon S3 integration components, to Amazon S3 for storage.
Talend fetches the Amazon S3 URL and uses it as a parameter in the Amazon Transcribe service.
Talend performs the call request, to Amazon Transcribe, to process the audio files. The audio file processing is done asynchronously within Amazon Transcribe. So, Talend performs the Job completion check at regular intervals, and once the Job is complete, collects the response from Amazon Transcribe.
The response data from Amazon Transcribe is stored in Amazon S3. Talend extracts the JSON response files to the JobServer.
Talend parses the JSON response files using native parsing components.
The data fetched from the JSON files is transferred to various downstream applications like data warehouses or Big Data systems using specialized Talend Palette components.

Configure a Talend routine for Amazon Transcribe

Create a Talend user routine by performing the following steps.

Connect to Talend Studio, and create a new routine called AWS_Transcribe that connects to the Amazon Transcribe service. There are two functions within this routine; to start the transcribe processing of a Job and to check the status of a transcribe Job.

Insert the following code into the Talend routine:

package routines;

//Amazon SDK 1.11.438

import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.services.transcribe.AmazonTranscribe;
import com.amazonaws.services.transcribe.AmazonTranscribeClient;
import com.amazonaws.services.transcribe.AmazonTranscribeClientBuilder;
import com.amazonaws.services.transcribe.model.StartTranscriptionJobRequest;
import com.amazonaws.services.transcribe.model.StartTranscriptionJobResult;
import com.amazonaws.services.transcribe.model.ListTranscriptionJobsRequest;
import com.amazonaws.services.transcribe.model.ListTranscriptionJobsResult;
import com.amazonaws.services.transcribe.model.Media;

import org.apache.commons.logging.LogFactory;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.annotation.JsonView;

import org.apache.http.protocol.HttpRequestExecutor;
import org.apache.http.client.HttpClient;
import org.apache.http.conn.DnsResolver;
import org.joda.time.format.DateTimeFormat;


public class AWS_Transcribe {

	public static String StartTranscriptionJob(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName, String input_s3_bucket, String input_file,String lang_code, String media_format, String output_s3_bucket, String TranscriptionJobName) 
	{

	// AWS Connection
		
	BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);
	AmazonTranscribe transcribe = AmazonTranscribeClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(awsCreds)).withRegion(AWS_regionName).build();

	String media= "https://s3-"+AWS_regionName+".amazonaws.com/"+input_s3_bucket+"/"+input_file;
	Media input_media = new Media().withMediaFileUri(media);
	
	//AWS_Transribe Start Transcription

	StartTranscriptionJobRequest request = new StartTranscriptionJobRequest()
	                                      .withLanguageCode(lang_code)
	                                      .withMediaFormat(media_format)
	                                      .withOutputBucketName(output_s3_bucket)
	                                      .withTranscriptionJobName(TranscriptionJobName)
	                                      .withMedia(input_media);
		                
	StartTranscriptionJobResult result  = transcribe.startTranscriptionJob(request);

	String response_text =result.toString();
	return response_text;
		
	}
	
	public static String ListTranscriptionJob(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName, String input_jobname) 
	{

	// AWS Connection
		
	BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);
	AmazonTranscribe transcribe = AmazonTranscribeClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(awsCreds)).withRegion(AWS_regionName).build();
	
	//AWS_Transribe List Transcription status

	ListTranscriptionJobsRequest request = new ListTranscriptionJobsRequest().withJobNameContains(input_jobname);
                                    		                
	ListTranscriptionJobsResult result  = transcribe.listTranscriptionJobs(request);

	String response_text =result.toString();
	return response_text;
	
	}
		
}

The Talend routine needs additional JAR files. Install the following JAR files in the routine:
- AWS SDK 1.11.438
- apache.commons.logging 1.2.0
- Jackson core 2.9.7
- Jackson Annotations 2.9.0
- Jackson Databind 2.9.7
- httpcore 4.4.10
- httpclient 4.5.6
- joda-time 2.9.4
Add additional Java libraries to the routine. For more information on how to add Java libraries, see the Talend and Amazon Comprehend Integration article of the series.

The setup activities for the routine are complete. The next section shows sample Jobs for the functionalities described in the practical use case.

For ease of understanding, and to keep the focus on the integration between Talend and Amazon Transcribe, the sample Job uses multiple audio files as input and a tLogrow component as output.

Talend sample Job for Amazon Transcribe

The message.mp3 and newname.mp3 files, attached to this article, act as input data files for the sample Job. The data from the input files is transmitted to Amazon S3 and from there to the Amazon Transcribe service. The response is captured (in JSON format) and sent back to Amazon S3. Then the response files are imported to the JobServer, parsed, and the corresponding output text is published in the console.

The configuration details are as follows:

Create a new Standard Job called AWS_Transcribe_sample_job, or use the sample Job, AWS_Transcribe_sample_job.zip, attached to this article.
The first stage in associating the routine to a Talend Job is to add the routines to the newly created Job, by selecting Setup routine dependencies.
Add the AWS_Transcribe routine to the User routines section of the pop-up screen, to link the newly created routine to the Talend Job.
Review the overall Job flow, shown in the following diagram:
Configure the context variables, as shown below:
In the PreJob section, using your S3 connection parameters fill in the context variables, then choose the AWS region from the Region pull-down menu.
Configure the tFileList and tFileDelete components to clear any existing files from the JobServer output directory, where JSON files are stored.
In the main Job, configure the tFileList and tS3Put components to transfer the input files to S3.
Using the tS3List (lists the files on Amazon S3) component, send the S3 object details to start the Transcribe Job.

The output from the tS3List component is parsed to the tRowGenerator component that captures the CURRENT_KEY from the tS3List component.

The object name, without a file extension, is parsed, as shown below:
```
Obj_name_ip.s3_object.substring(0,Obj_name_ip.s3_object.indexOf(".")) 
```
The data is replicated.
Notice that the first output flow from the tReplicate component is transmitted to the tAggregatedRow component to take the total object count.

The count is added to the context.object_count (the initial value is zero).

Notice that in the second output flow from the tReplicate component, the data moves to the tMap component where the object name (with file type) and Transcription Job name are determined.

s3_object            =     Transcribe_ip.s3_object+"."+context.media_format
TranscriptionJobName =     context.input_s3_bucket+"-"+Transcribe_ip.s3_object+"-"+TalendDate.formatDate("yyyyMMddHHmmss",TalendDate.getCurrentDate())

Use the tHashOutput component to store the data for later usage.

The output from the tHashOutput component is linked to tMap component where the Start Transcribe Call is made.

The routine call made is shown below:
```
AWS_Transcribe.StartTranscriptionJob(context.AWS_Access_Key, context.AWS_Secret_Key, context.AWS_regionName, context.input_s3_bucket, start_call_ip.s3_object, context.lang_code, context.media_format, context.output_s3_bucket, start_call_ip.TranscriptionJobName) 
```
The output of the routine call can be transferred to a tLogRow component if required to print the JSON output message. This example uses a dummy tJavaRow component (with no code inside) to ignore the output message.
Using a tLoop component, create a loop to verify the status of the Transcribe Job at regular intervals.

The data iteration starts with a tRowGenerator component that fetches the Transcribe Job status through a routine call.

AWS_Transcribe.ListTranscriptionJob(context.AWS_Access_Key, context.AWS_Secret_Key, context.AWS_regionName, context.input_s3_bucket)

Use the tExtractJSONFields component to parse the output data (in JSON format) as shown in the status column below:
Filter the data to select only records with a COMPLETED status.
Using a tAggregateRow component, aggregate the Jobs with a COMPLETED status.
Assign the final count to the status_count context variable.
Create an If condition and verify that the original Object count matches the successfully completed Transcribe Job Count. If it does not match, the control goes to a tSleep component that invokes a 30 second sleep time.
After all the Transcribe Jobs are complete, fetch the output JSON files containing text back to the JobServer. The list of objects is present in the tHashInput component, that connects to the tFlowToIterate component to perform the iteration of data.

Use the tS3Get component to fetch the data to the JobServer.

Key =   ((String)globalMap.get("object_name_ip.TranscriptionJobName"))+".json"
File =  context.transcribe_output_folder+((String)globalMap.get("object_name_ip.TranscriptionJobName"))+".json"

Using the tFileList component, get the output JSON files containing the data.
Use the tFileInputJSON component to parse the JSON files to get the output text data.

Using a tMap component, identify the source of the text data by adding the Job name, as shown below:

The data is passed to the tLogrow component to print the output in the console. The output of the audio files is shown below.

.-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------.
|                                                                                                                                                                                                                                                                 Text_Output                                                                                                                                                                                                                                                                  |
|=------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|jobname                                    |transcript                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|=------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|nikz-transcribe-message-20190531145808.json|Here's a little holiday greeting I've been wanting to send to the Mandarin. I just didn't know how to phrase it until now. My name is Tony Stark, and I'm not afraid of you. I know you're a coward, so I've decided that you just died down. I'm gonna come get the body. There's no politics here. It's just good old fashioned revenge. There's no Pentagon. It's just you and me on the off chance you're a man. Here's my home address. 10 8 80 Malibu 800.9265 I'll leave the door unlocked.|
'-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------'

.-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------.
|                                                                                                           Text_Output                                                                                                           |
|=------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|jobname                                    |transcript                                                                                                                                                                           |
|=------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|nikz-transcribe-newname-20190531145808.json|Mr Stark? Yeah. Agent Coulson. Oh, yeah. Yeah. The guy from the Strategic Homeland Intervention Enforcement Logistics Division. Get you a new name for that? Yeah, I hear that a lot.|
'-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------'

In practical scenarios, the output at this stage can be passed to downstream systems for further processing and storage.

Threshold limits for data processing

At the time of this writing; Amazon Transcribe can handle 10 Start Transcribe Jobs per second. Talend recommends that you always verify the latest performance benchmarks on the AWS Documentation, Amazon Transcribe Limits page.

Note: that you can increase the standard limits by submitting the Amazon Transcribe service limits increase form.

Conclusion

This article depicts the use case of integrating Talend with the Amazon Transcribe service. In real time scenarios, data can flow from multiple source systems, such as batch files, web services, queues, or APIs. Talend can integrate all these diverse source systems with the Amazon Transcribe service in a straightforward way.

Citations

AWS Documentation:

Talend and Amazon Transcribe Integration