
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Talend DI Job execution in AWS Lambda using Apache Airflow
Feb 5, 2025 4:17:40 AM
Apr 1, 2021 5:57:26 AM
Overview
This article shows you how to leverage Apache Airflow to orchestrate, schedule, and execute Talend Data Integration (DI) Jobs in an AWS Lambda environment.
Environment
- Apache Airflow 1.10.2
- AWS Lambda
- Amazon API Gateway
- Amazon CloudWatch Logs
- Nexus 3.9
- WinSCP 5.15
- PuTTY
Prerequisites
- Apache Airflow installed on a server (follow the Installing Apache Airflow on Ubuntu/AWS installation instructions).
- Python 2.7 installed on the Airflow server.
- A licensed Amazon Web Services (AWS) account.
- Access to Nexus server from AWS Lambda.
- Talend 7.x Jobs published to the Nexus repository. (For more information on how to set up a CI/CD pipeline to publish Talend Jobs to Nexus, see Configuring Jenkins to build and deploy project items in the Talend Help Center.)
- Access to the Setup_files.zip file (attached to this article).
Process flow
- Develop Talend DI Jobs using Talend Studio.
- Publish the DI Jobs to the Nexus Repository using Talend CI/CD module.
- Execute the Directed Acyclic Graph (DAG) in Apache Airflow:
- The DAG calls Amazon API Gateway triggering the AWS Lambda function.
- The Lambda function downloads the Talend DI Job executable binaries from the Nexus repository.
- The Lambda function executes the downloaded Job binaries.
- The Lambda function returns a response to the API call and marks the task as completed in Apache Airflow.
Configuration and execution
Configuring AWS Lambda
- Login to your AWS account and open the AWS Lambda service.
- Click the Create function button.
- Select Author from scratch, then enter a Function name. Select Python 3.6 from the Runtime drop-down menu.
-
Under Permissions, select Create a new role with basic Lambda permissions from the Execution role drop-down menu. Click Create function.
-
After the function is created, select API Gateway, from the Add triggers dialog box on the left, to add a trigger to the function.
For more information, see the Amazon API Gateway page.
-
When the API Gateway is added, a Configuration required warning message appears. Click the API Gateway tile to configure the trigger details. In the Configure triggers section, select Create a new API from the API drop-down menu and Open from the Security drop-down menu. Click Add.
-
Click the Save button in the upper right corner.
-
Select the API Gateway tile and review the Details section of the Open API.
-
Copy the code from the lambda_function_code.py file (located in the Setup_files.zip) into the lambda_function.py in the Function Code window.
-
Create a new file named download_job.sh and save it under the lambda_function folder. Copy the code from the download_job_code.sh (located in the Setup_files.zip) file, into the new file you created.
-
In the same window, scroll down to Basic settings and increase the Memory (MB) to a reasonable amount, in this case, 2048 MB. Click Save.
Configuring Apache Airflow
- Log in to the Airflow Web UI.
- Navigate to Admin > Connection and create a new connection. Enter aws_api in the Conn Id field and leave the Conn Type field empty. Add the host address of the API endpoint you created in the Configuring AWS Lambda section, in the Host field. Click Save.
-
Edit the lambda_DAG_call_template.py file and assign values to the variables, as shown below:
Make sure to provide the http_conn_id and endpoint values under SimpleHttpOperator calls. In this case, the http_conn_id is aws_api.
-
The DAG template provided is programmed to trigger the task externally. If you plan to schedule the task, update the schedule_interval parameter with values based on your scheduling requirements. For more information on values, see the Apache Airflow documentation: DAG Runs.
-
Rename the updated file and place it in the dags folder under the AIRFLOW_HOME folder.
-
The Airflow webserver picks up the file and creates a DAG task in the Airflow Console, under the DAGs tab.
Note: If the DAG is not visible on the User Interface under the DAGs tab, restart the Airflow webserver and Airflow scheduler.
Executing the Job and reviewing the logs
- To execute the Talend Job, toggle the button to On and run the Airflow task you created to trigger the AWS Lambda function.
- Monitor the task execution on the Airflow Web UI.
- Review the Job logs in the Amazon CloudWatch Logs service. Open the CloudWatch service and select Logs from the menu on the left. Select your Lambda function from the Log groups and open the log.
Conclusion
In this article, you learned how to execute Talend DI Jobs in AWS Lambda, and how to use Apache Airflow to schedule the Jobs, which can be extended for further complex orchestration and scheduling plan.