
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Automate S3 file processing with Talend Cloud and AWS Lambda
May 21, 2024 2:20:00 AM
Apr 1, 2021 6:15:13 AM
The integration between AWS S3 and Lambda is very common in the Amazon world, and many examples include executing the Lambda function upon S3 file arrival.
This article explains how to use AWS to execute a Talend Cloud Job.
Content:
- Overview
- Prerequisites
- Configuring Amazon S3
- Creating a Bucket
- Creating a Policy and a User
- Creating a Policy
- Giving a User Programmatic Access
- Creating and Publishing a Talend Job in Talend Cloud
- Creating a Job
- Publishing the Job in Talend Cloud
- AWS Lambda
Overview
-
A file is uploaded to an S3 bucket.
-
S3 triggers the Lambda function.
-
The Lambda function calls a Talend Flow.
-
The Talend Flow retrieves the S3 file to process it based on the parameters sent by the Lambda function.
Prerequisites
-
A valid AWS account with access to the following:
-
S3
-
Lambda
-
A Talend Cloud account or trial account
Configuring Amazon S3
Creating a Bucket
-
Sign in to your Amazon account and open the Amazon S3 page.
-
Click Create bucket.
- Configure the following fields, then click Next:
-
Bucket name: The bucket name must be unique across all AWS.
-
Region: Select the region where your bucket resides, in this case, Ireland.
-
-
Keep the default settings. Click Next.
-
Keep the default permissions. Review the configuration, then click Create bucket.
Creating a Policy and a User
When accessing S3 with a remote Job, you need to give a user programmatic access (no access to the S3 console) and you need to create a policy limiting the user/application access to only this bucket.
Creating a Policy
-
In the AWS console, navigate to the IAM (Identity and Access Management) page.
-
Navigate to the Policies section, then click Create policy.
-
Using the visual editor, configure the policy as shown below:
-
Service: Select S3.
-
Action: Select GetObject and GetObjectVersion. GetObject allows you to retrieve the file in your Job.
-
Resources: Point to your S3 bucket using ARN (Amazon Resource Name). The * at the end means all objects in your S3 bucket.
-
Request conditions: Leave as is.
-
Click JSON to see your policy in a JSON format, as shown below:
-
-
Review your policy, then click Create policy.
Giving a User Programmatic Access
-
In IAM, navigate to the Users section, then click Add user.
-
Select a User name, select the Programmatic access check box, then click Next: Permissions.
-
Select Attach existing policies directly, and choose the policy you created in the previous section.
-
Review your settings, then click Create user.
-
Well done, your user is created. Do not forget to download and save the Access and Secret keys.
Creating and Publishing a Talend Job in Talend Cloud
In this section, you learn how to create and publish a Talend Job in Talend Cloud.
Creating a Job
Create a Job that retrieves a file from S3, and displays the data in the console. Of course, a real Job will be more complex.
In Amazon S3, upload a file to test your Job.
-
Create a folder and name it connections.
-
Create a file, in this example connections_012018.csv, then upload the file to the connections folder.
-
In Studio, create a new context group called S3Parameters, then click Next.
-
Configure the following parameters using the information from your S3 bucket, then click Finish:
-
parameter_accessKey: the access key used by your application to connect to Amazon S3
-
parameter_secretKey: the secret key used by your application to connect to Amazon S3
-
parameter_bucketName: the bucket name on S3
-
parameter_bucketKey: the file key—on S3, there is no folder so the path is considered the file key
-
parameter_tempFolder: the temporary folder where you will store the file for processing—on Talend Cloud Engine, it is /tmp/
-
-
Create a new Job, and name it S3Read. The Job is composed of three stages:
- Create an object connection to S3.
- Get the file from S3.
- Read the file.
-
Configure the tS3Connection component to a specific region, and the context variables for Access and Secret keys.
-
Configure the tS3Get component to retrieve the file based on the context parameters, and store it in the temp folder.
-
Configure the tFileInputDelimited component to read the file stored in the temp folder.
-
Test the Job locally to see if it connects and reads the file correctly.
-
Next, upload the Job to Talend Cloud. Navigate to Window > Preferences > Talend > Integration Cloud and configure your access to Talend Cloud.
-
Once a connection is established, right-click the Job and select Publish to Cloud.
-
Click Finish.
-
When the Job has finished uploading, click Open Job Flow.
Publishing the Job in Talend Cloud
-
In Talend Cloud, you can see the required parameters.
-
Update the configuration based on your own bucket, then click Save.
-
Select your runtime, for this example, use a cloud engine.
-
Because you configured with an existing file, you can test your Job by clicking Run Now.
-
You will see the content of your file in the log.
-
Now, test your Job using a remote call with Talend Cloud API.
-
Confirm that you are using v1.1 API, then click Authorize.
-
Log in using your Talend Cloud account credentials.
-
Now, find the Flow Id. In Talend Cloud > Integration Cloud > Flows > the Flow Id is in the upper left corner of your flow.
-
For this example, use the POST /executions operations.
-
Create a body with:
-
executable: your Flow Id
-
parameters: all context variables you want to overwrite. In this example, specify the bucket name.
-
-
Scroll down, then click Try it out!
-
Review the results.
-
Check your flow and notice that a second execution appears.
AWS Lambda
At this stage, you have deployed your Job to Talend Cloud and tested a call with the API. Now, create the Lambda function, which is triggered through the API for each new file and call in your Job.
-
Connect to your AWS console, and in the Lambda section, select Create a function.
-
Give your function a name. Select the runtime Python 3.6. In the Role section, select Create custom role.
-
Create a new Role, it will create a role and a new role policy.
-
Review the configuration, then click Create function.
-
To create the trigger, select an S3 trigger on the left under Designer.
-
Configure the trigger with your bucket name and a prefix (in this example, the connections folder). Select Enable trigger, then click Add.
-
Verify that the new trigger was added.
-
Copy the code from the function in the lambda_function.py file attached to this article.
-
Configure the environment variables:
-
TCLOUD_API_ENDPOINT: URL to call the API
-
TCLOUD_USER: User that has the right to call the API
-
TCLOUD_PWD: the TCLOUD_USER password
-
TCLOUD_FLOWID: Talend Flow Id of the Job
-
-
Add tags to identify your function.
-
Save your function. Now you can add a new file to your folder in S3, and you will see an execution of the Lambda function.
-
In Talend Cloud, verify there is a third execution.
-
You will see the content of your file in the log.
For more information, see the AWS documentation, Using AWS Lambda with Amazon S3 page.