Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
The integration between AWS S3 and Lambda is very common in the Amazon world, and many examples include executing the Lambda function upon S3 file arrival.
This article explains how to use AWS to execute a Talend Cloud Job.
Content:
A file is uploaded to an S3 bucket.
S3 triggers the Lambda function.
The Lambda function calls a Talend Flow.
The Talend Flow retrieves the S3 file to process it based on the parameters sent by the Lambda function.
A valid AWS account with access to the following:
S3
Lambda
A Talend Cloud account or trial account
Sign in to your Amazon account and open the Amazon S3 page.
Click Create bucket.
Bucket name: The bucket name must be unique across all AWS.
Region: Select the region where your bucket resides, in this case, Ireland.
Keep the default settings. Click Next.
Keep the default permissions. Review the configuration, then click Create bucket.
When accessing S3 with a remote Job, you need to give a user programmatic access (no access to the S3 console) and you need to create a policy limiting the user/application access to only this bucket.
In the AWS console, navigate to the IAM (Identity and Access Management) page.
Navigate to the Policies section, then click Create policy.
Using the visual editor, configure the policy as shown below:
Service: Select S3.
Action: Select GetObject and GetObjectVersion. GetObject allows you to retrieve the file in your Job.
Resources: Point to your S3 bucket using ARN (Amazon Resource Name). The * at the end means all objects in your S3 bucket.
Request conditions: Leave as is.
Click JSON to see your policy in a JSON format, as shown below:
Review your policy, then click Create policy.
In IAM, navigate to the Users section, then click Add user.
Select a User name, select the Programmatic access check box, then click Next: Permissions.
Select Attach existing policies directly, and choose the policy you created in the previous section.
Review your settings, then click Create user.
Well done, your user is created. Do not forget to download and save the Access and Secret keys.
In this section, you learn how to create and publish a Talend Job in Talend Cloud.
Create a Job that retrieves a file from S3, and displays the data in the console. Of course, a real Job will be more complex.
In Amazon S3, upload a file to test your Job.
Create a folder and name it connections.
Create a file, in this example connections_012018.csv, then upload the file to the connections folder.
In Studio, create a new context group called S3Parameters, then click Next.
Configure the following parameters using the information from your S3 bucket, then click Finish:
parameter_accessKey: the access key used by your application to connect to Amazon S3
parameter_secretKey: the secret key used by your application to connect to Amazon S3
parameter_bucketName: the bucket name on S3
parameter_bucketKey: the file key—on S3, there is no folder so the path is considered the file key
parameter_tempFolder: the temporary folder where you will store the file for processing—on Talend Cloud Engine, it is /tmp/
Create a new Job, and name it S3Read. The Job is composed of three stages:
Configure the tS3Connection component to a specific region, and the context variables for Access and Secret keys.
Configure the tS3Get component to retrieve the file based on the context parameters, and store it in the temp folder.
Configure the tFileInputDelimited component to read the file stored in the temp folder.
Test the Job locally to see if it connects and reads the file correctly.
Next, upload the Job to Talend Cloud. Navigate to Window > Preferences > Talend > Integration Cloud and configure your access to Talend Cloud.
Once a connection is established, right-click the Job and select Publish to Cloud.
Click Finish.
When the Job has finished uploading, click Open Job Flow.
In Talend Cloud, you can see the required parameters.
Update the configuration based on your own bucket, then click Save.
Select your runtime, for this example, use a cloud engine.
Because you configured with an existing file, you can test your Job by clicking Run Now.
You will see the content of your file in the log.
Now, test your Job using a remote call with Talend Cloud API.
Confirm that you are using v1.1 API, then click Authorize.
Log in using your Talend Cloud account credentials.
Now, find the Flow Id. In Talend Cloud > Integration Cloud > Flows > the Flow Id is in the upper left corner of your flow.
For this example, use the POST /executions operations.
Create a body with:
executable: your Flow Id
parameters: all context variables you want to overwrite. In this example, specify the bucket name.
Scroll down, then click Try it out!
Review the results.
Check your flow and notice that a second execution appears.
At this stage, you have deployed your Job to Talend Cloud and tested a call with the API. Now, create the Lambda function, which is triggered through the API for each new file and call in your Job.
Connect to your AWS console, and in the Lambda section, select Create a function.
Give your function a name. Select the runtime Python 3.6. In the Role section, select Create custom role.
Create a new Role, it will create a role and a new role policy.
Review the configuration, then click Create function.
To create the trigger, select an S3 trigger on the left under Designer.
Configure the trigger with your bucket name and a prefix (in this example, the connections folder). Select Enable trigger, then click Add.
Verify that the new trigger was added.
Copy the code from the function in the lambda_function.py file attached to this article.
Configure the environment variables:
TCLOUD_API_ENDPOINT: URL to call the API
TCLOUD_USER: User that has the right to call the API
TCLOUD_PWD: the TCLOUD_USER password
TCLOUD_FLOWID: Talend Flow Id of the Job
Add tags to identify your function.
Save your function. Now you can add a new file to your folder in S3, and you will see an execution of the Lambda function.
In Talend Cloud, verify there is a third execution.
You will see the content of your file in the log.
For more information, see the AWS documentation, Using AWS Lambda with Amazon S3 page.
When trying to execute a Task, the execution fails with the following error:
Exceeded the limit of deployment attempts: you have reached the limit of Flow deployments on the engine.
You have reached the maximum number of allowable concurrent flows (running Jobs). The limit is set by license, or where the Remote Engine is being used to run the flows—your configuration.
A Cloud Engine can only run up to three flows at the same time (that is, three concurrent flows).
A Remote Engine can only run up to three flows at the same time by default.
So, when you look at your flow execution history in Talend Cloud, and you see this error message for some flows that have attempted to execute, it means that you already have three flows running. Thus, there are no open execution slots for the new flow to run. The flow that is not able to run is rescheduled and remains in the queue. When one of the currently-running Jobs finishes, the next flow in the queue is run.
If a flow is not able to be executed after several rescheduling attempts (that is, no open execution slots have opened up, or other flows were first/earlier in the queue), the flow moves into an error state and will no longer attempt to run.
You can modify or remove the limit by following the instructions and modifying the configuration of your Remote Engine.
max.deployed.flows=3
remote.engine.pre.authorized.key =
remote.engine.name = dev_remote_engine_1
remote.engine.description = Cool remote engine for dev 1
Be more agile, get more value from data, foster greater collaboration, and enable data users to become more effective by moving to Talend Data Fabric in the cloud.
As a Talend user, you recognize the value of data to your business. Organizations that use data and analytics to drive business strategy adapt to change quickly and develop insights that generate new value. They harvest data to improve productivity, make faster and more accurate decisions, and reduce costs. They become more innovative and competitive, discover and deploy new business models more effectively, and foster better engagement with customers, employees, and partners.
Accomplishing all this, however, is not easy. The rising pace of business and the increasing complexity of the data landscape — more data, more users, more applications, more environments (on premises, cloud, and hybrid), and more regulation — make it harder for organizations to have complete, clean, and trusted data they can rely on. No wonder that 60% of companies have unreliable data health. Data workers in some organizations spend two-thirds of their time searching and preparing data rather than using it for making decisions and running the business.
You’re already ahead of the pack, because you rely on Talend to help you find trust amidst this data chaos and deliver data that is complete, clean, uncompromised, and readily available across the organization.
You can do even more by moving to Talend in the cloud.
Download the full document from this article.