Skip to main content
Announcements
Qlik Connect 2025! Join us in Orlando join us for 3 days of immersive learning: REGISTER TODAY

Automate S3 file processing with Talend Cloud and AWS Lambda

No ratings
cancel
Showing results for 
Search instead for 
Did you mean: 
TalendSolutionExpert
Contributor II

Automate S3 file processing with Talend Cloud and AWS Lambda

Last Update:

May 21, 2024 2:20:00 AM

Updated By:

Sonja_Bauernfeind

Created date:

Apr 1, 2021 6:15:13 AM

Attachments

The integration between AWS S3 and Lambda is very common in the Amazon world, and many examples include executing the Lambda function upon S3 file arrival.

This article explains how to use AWS to execute a Talend Cloud Job.

Content:

 

Overview

  1. A file is uploaded to an S3 bucket.

  2. S3 triggers the Lambda function.

  3. The Lambda function calls a Talend Flow.

  4. The Talend Flow retrieves the S3 file to process it based on the parameters sent by the Lambda function.

0693p000008uDndAAE.png

Prerequisites

  • A valid AWS account with access to the following:

    • S3

    • Lambda

  • A Talend Cloud account or trial account

 

Configuring Amazon S3

Creating a Bucket

  1. Sign in to your Amazon account and open the Amazon S3 page.

  2. Click Create bucket.

    0693p000008uDU3AAM.png

     

  3. Configure the following fields, then click Next:
    1. Bucket name: The bucket name must be unique across all AWS.

    2. Region: Select the region where your bucket resides, in this case, Ireland.

    0693p000008uDtUAAU.png

  4. Keep the default settings. Click Next.

    0693p000008uDJAAA2.png

  5. Keep the default permissions. Review the configuration, then click Create bucket.

    0693p000008uDGgAAM.png

 

Creating a Policy and a User

When accessing S3 with a remote Job, you need to give a user programmatic access (no access to the S3 console) and you need to create a policy limiting the user/application access to only this bucket.

Creating a Policy

  1. In the AWS console, navigate to the IAM (Identity and Access Management) page.

  2. Navigate to the Policies section, then click Create policy.

    0693p000008uDtjAAE.png

    0693p000008uDW3AAM.png

  3. Using the visual editor, configure the policy as shown below:

    1. Service: Select S3.

    2. Action: Select GetObject and GetObjectVersion. GetObject allows you to retrieve the file in your Job.

    3. Resources: Point to your S3 bucket using ARN (Amazon Resource Name). The * at the end means all objects in your S3 bucket.

    4. Request conditions: Leave as is.

      0693p000008uCxlAAE.png

    5. Click JSON to see your policy in a JSON format, as shown below:

      0693p000008uDttAAE.png

  4. Review your policy, then click Create policy.

    0693p000008uDcDAAU.png

Giving a User Programmatic Access

  1. In IAM, navigate to the Users section, then click Add user.

    0693p000008uDczAAE.png

  2. Select a User name, select the Programmatic access check box, then click Next: Permissions.

    0693p000008uDZ6AAM.png

  3. Select Attach existing policies directly, and choose the policy you created in the previous section.

    0693p000008uDs8AAE.png

  4. Review your settings, then click Create user.

    0693p000008uDqSAAU.png

  5. Well done, your user is created. Do not forget to download and save the Access and Secret keys.

    0693p000008uDswAAE.png

Creating and Publishing a Talend Job in Talend Cloud

In this section, you learn how to create and publish a Talend Job in Talend Cloud.

Creating a Job

Create a Job that retrieves a file from S3, and displays the data in the console. Of course, a real Job will be more complex.

In Amazon S3, upload a file to test your Job.

  1. Create a folder and name it connections.

    0693p000008uBw5AAE.png

  2. Create a file, in this example connections_012018.csv, then upload the file to the connections folder.

    0693p000008uDdXAAU.png

  3. In Studio, create a new context group called S3Parameters, then click Next.

    0693p000008uDmFAAU.png

  4. Configure the following parameters using the information from your S3 bucket, then click Finish:

    • parameter_accessKey: the access key used by your application to connect to Amazon S3

    • parameter_secretKey: the secret key used by your application to connect to Amazon S3

    • parameter_bucketName: the bucket name on S3

    • parameter_bucketKey: the file key—on S3, there is no folder so the path is considered the file key

    • parameter_tempFolder: the temporary folder where you will store the file for processing—on Talend Cloud Engine, it is /tmp/

      0693p000008uDk9AAE.png

  5. Create a new Job, and name it S3Read. The Job is composed of three stages:

    1. Create an object connection to S3.
    2. Get the file from S3.
    3. Read the file.

    0693p000008uDjLAAU.png

  6. Configure the tS3Connection component to a specific region, and the context variables for Access and Secret keys.

    0693p000008uDumAAE.png

  7. Configure the tS3Get component to retrieve the file based on the context parameters, and store it in the temp folder.

    0693p000008uDuwAAE.png

  8. Configure the tFileInputDelimited component to read the file stored in the temp folder.

    0693p000008uDv6AAE.png

  9. Test the Job locally to see if it connects and reads the file correctly.

    0693p000008uDuiAAE.png

  10. Next, upload the Job to Talend Cloud. Navigate to Window > Preferences > Talend > Integration Cloud and configure your access to Talend Cloud.

    0693p000008uDozAAE.png

  11. Once a connection is established, right-click the Job and select Publish to Cloud.

    0693p000008uDXpAAM.png

  12. Click Finish.

    0693p000008uDadAAE.png

  13. When the Job has finished uploading, click Open Job Flow.

    0693p000008uDvQAAU.png

Publishing the Job in Talend Cloud

  1. In Talend Cloud, you can see the required parameters.

    0693p000008uDvaAAE.png

  2. Update the configuration based on your own bucket, then click Save.

    0693p000008uDddAAE.png

  3. Select your runtime, for this example, use a cloud engine.

    0693p000008uDlvAAE.png

  4. Because you configured with an existing file, you can test your Job by clicking Run Now.

    0693p000008uDvpAAE.png

    0693p000008uDgqAAE.png

  5. You will see the content of your file in the log.

    0693p000008uDkAAAU.png

  6. Now, test your Job using a remote call with Talend Cloud API.

    0693p000008uDw9AAE.png

  7. Confirm that you are using v1.1 API, then click Authorize.

    0693p000008uDc7AAE.png

  8. Log in using your Talend Cloud account credentials.

    0693p000008uDvbAAE.png

  9. Now, find the Flow Id. In Talend Cloud > Integration Cloud > Flows > the Flow Id is in the upper left corner of your flow.

    0693p000008uDffAAE.png

    0693p000008uDjqAAE.png

  10. For this example, use the POST /executions operations.

    0693p000008uDtpAAE.png

  11. Create a body with:

    • executable: your Flow Id

    • parameters: all context variables you want to overwrite. In this example, specify the bucket name.

    0693p000008uDwOAAU.png

  12. Scroll down, then click Try it out!

    0693p000008uDqTAAU.png

  13. Review the results.

    0693p000008uDXLAA2.png

  14. Check your flow and notice that a second execution appears.

    0693p000008uDwYAAU.png

AWS Lambda

At this stage, you have deployed your Job to Talend Cloud and tested a call with the API. Now, create the Lambda function, which is triggered through the API for each new file and call in your Job.

  1. Connect to your AWS console, and in the Lambda section, select Create a function.

    0693p000008uDaBAAU.png

  2. Give your function a name. Select the runtime Python 3.6. In the Role section, select Create custom role.

    0693p000008uDwiAAE.png

  3. Create a new Role, it will create a role and a new role policy.

    0693p000008uDwnAAE.png

  4. Review the configuration, then click Create function.

    0693p000008uDwxAAE.png

  5. To create the trigger, select an S3 trigger on the left under Designer.

    0693p000008uDx7AAE.png

  6. Configure the trigger with your bucket name and a prefix (in this example, the connections folder). Select Enable trigger, then click Add.

    0693p000008uDoWAAU.png

  7. Verify that the new trigger was added.

    0693p000008uDxRAAU.png

  8. Copy the code from the function in the lambda_function.py file attached to this article.

    0693p000008uDxbAAE.png

  9. Configure the environment variables:

    • TCLOUD_API_ENDPOINT: URL to call the API

    • TCLOUD_USER: User that has the right to call the API

    • TCLOUD_PWD: the TCLOUD_USER password

    • TCLOUD_FLOWID: Talend Flow Id of the Job

    0693p000008uDvcAAE.png

  10. Add tags to identify your function.

    0693p000008uDgAAAU.png

  11. Save your function. Now you can add a new file to your folder in S3, and you will see an execution of the Lambda function.

    0693p000008uDfFAAU.png

  12. In Talend Cloud, verify there is a third execution.

    0693p000008uDxqAAE.png

  13. You will see the content of your file in the log.

    0693p000008uDBuAAM.png

     

For more information, see the AWS documentation, Using AWS Lambda with Amazon S3 page.

Labels (2)