Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Jan 22, 2024 9:35:30 PM
Apr 14, 2021 5:37:32 PM
This article shows you how to containerize, schedule, and provision your Talend ETL Jobs on serverless platforms like Amazon EKS leveraging Apache Airflow.
The article covers:
Docker an open-source platform that is used to create, deploy, and run applications by using containers
Kubernetes an open-source system for automating deployment, scaling, and the operations of application containers across clusters of hosts
Apache Airflow an open-source platform to programmatically author, schedule, orchestrate, and monitor workflows
Table of Contents:
Note: For information on installing Talend Software modules, see Talend Data Integration Installation Guide for Linux on the Talend Help Center.
The following sections guide you through the installation steps required to install Airflow.
Note: In this article, Airflow is installed with Docker on an EC2 instance. However, you can also install Airflow on containerized platforms like ECS or Kubernetes.
Create a Kubernetes cluster following the instructions in Getting Started with the AWS Management Console, and execute the steps to:
After execution of all the steps above, ensure that you have:
Verify that the worker nodes have joined the cluster successfully by using the Kubectl tool.
Kubernetes supports several Authentication strategies and Authorization modes.
Amazon EKS uses IAM to provide authentication to our Kubernetes cluster through the AWS IAM Authenticator for Kubernetes. For authorization, EKS relies on native Kubernetes Role Based Access Control (RBAC).
Note: The security choices made in this article are completely based on what Amazon EKS supports while writing this article. Please verify the AWS EKS documentation for the latest security options.
In this article, the client application, Airflow, authenticates with the EKS API server using webhook tokens and performs tasks on the EKS cluster based on the system:masters permissions assigned to the k8s_role. Create an IAM k8s_role and an EKS RBAC configuration by performing the following steps.
From the AWS Management Console, create an IAM role, for example, k8s_role.
Note: In the attachments, you'll find a sample k8s_role_policy that needs to be assumed by the client applications, for example, Airflow running on an EC2 instance, so that it can access other AWS services such as EKS, ECR, IAM, ECS. Your EKS cluster administrator can create this role with the appropriate permissions.
Update the aws-auth ConfigMap and map the newly created k8s_role, by following the instructions on how To apply the aws-auth ConfigMap to your cluster. An example is available in the aws-auth-cm.yaml file (attached to this article). Configure the mapRoles and rolearn elements (highlighted in red below) using your settings.
The Airflow EC2 instance is now configured to assume the IAM k8s_role and to use the RBAC Authorization.
Before starting with the installation process, you need to understand all the components in Airflow.
Note: As shown in the Architecture diagram, install all of the Airflow components, Airflow EC2 instance, RDS Database, and Redis in the same VPC.
Create an RDS PostgreSQL database as backend.
From the AWS Management Console, create a PostgreSQL database instance.
Note: You can install any database backend that is supported by the SQLAlchemy library.
Create a Redis Cluster as Celery backend.
From the AWS Management Console, create an Elasticache cluster with Redis engine.
Note: Airflow uses messaging techniques to scale out the number of workers, see Scaling Out with Celery Redis is an open-source in-memory data structure store, used as a database, cache and message broker.
Launch an EC2, Ubuntu Linux 18.04 instance (for example, t2.xlarge) and assign the k8s_role created in the previous step.
SSH to the Airflow EC2 instance, Create talend user:
sudo adduser talend ##add talend user to sudoers sudo usermod -aG sudo talend
Install Docker:
sudo apt update ##Install docker from the official docker repository. sudo apt install apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable" sudo apt update apt-cache policy docker-ce sudo apt install docker-ce ## verify docker daemon is in running state sudo systemctl status docker ##Add talend user to docker group sudo usermod -aG docker talend
Install Docker Compose:
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose docker-compose --version
Install AWS CLI using python3 and pip3:
#Reference AWS documentation https://docs.aws.amazon.com/cli/latest/userguide/install-linux.html#install-linux-pip curl -O https://bootstrap.pypa.io/get-pip.py sudo apt-get install python3-distutils python3 get-pip.py --user export PATH=~/.local/bin:$PATH source ~/.bashrc pip3 --version pip3 install awscli --upgrade --user
Note: By default, python3 is installed on an EC2 Ubuntu instance.
Configure AWS CLI:
aws configure AWS Access Key ID [None]: provide_access_key AWS Secret Access Key [None]:provide_secret_key Default region name [None]: provide_region e.g: eu_central_1 Default output format [None]: json
Copy the Airflow setup files from the preparation_files.zip (attached to this article):
mkdir ~/airflow sudo apt-get install zip # Copy here the preparation files attached to this article # Download preparation files using wget wget https://community.talend.com/yutwg22796/attachments/yutwg22796/trb_develop/411/2/preparation_files.zip
Configure the environment variables required for Airflow in the entrypoint.sh script file.
Configure Airflow to communicate with its other core components, namely Database and Elasticache.
Set the Database connection parameters and Message broker URL settings either in the Airflow configuration file, for example, airflow.cfg or as environment variables.
Note: The Dockerfile (attached in this article) executes the entrypoint.sh script and environmental variables are used to set the connection parameters.
cd ~/airflow/script vi entrypoint.sh ##Update REDIS & POSTGRES connection parameters and save the file
Copy your kubeconfig file to the Airflow config folder.
Note: Use the kubeconfig file that was generated when you created the EKS cluster.
cd ~/airflow/config # Now copy your EKS cluster kubeconfig file in this folder # After you copied manually verify the file is present using the ls command ls ~/airflow/config/config # Now you have to modify the "exec" command in your config file to use aws-iam-authenticator Note: aws-iam-authenticator should be used here to generate an identity token. The token is generated based on your awscli credentials # For an example, look the syntax in the sample file sample_config_file # Your exec command in the file should look like below exec: {apiVersion: client.authentication.k8s.io/v1alpha1, args: [token, '-i', REPLACE_WITH_YOUR_KUBERNETE_CLUSTER_NAME], command: aws-iam-authenticator}
Execute the docker build command and create an Airflow docker image:
cd ~/airflow docker build -t xxx/docker-airflow-aws-eks:1.10.2 . Note: Replace the place holder xxx with your docker user
Launch Airflow services using docker-compose:
# Execute docker-compose command to launch all the components of airflow websever, scheduler, worker & flower docker-compose -f docker-compose-CeleryExecutor.yml up -d
Verify the running containers.
#List the running containers with the ps command docker ps
Congratulations, you’ve installed Airflow.
Log in to the Airflow Web UI.
To access the Airflow Web UI, type the following URL into your browser.
http://AIRFLOW_EC2_INSTANCE_PUBLIC_IP:8080/admin
Download and import the demo Job, LOCAL.7z, (attached to this article) into Studio.
Notice that the tMap_1 Job, uses a tFixedFlowInput component to generate 100 records, and that the tMap component converts the content in the txt column to uppercase and multiplies the number column with 10.
Run the Job. It should output 100 records to your console.
Using Talend Studio, publish the Job to ECR.
Create a repository in ECR.
In Studio, right-click the Job and select Publish, then select Docker Image from the Export Type pull-down menu. Click Next.
Fill in the ECR repository details. Click Finish.
Note: Ensure the Registry + Image name matches the Repo URI created in Step 1.
Confirm that the Job is published to the repository with version 0.1.0.
For information on publishing Jobs using CI and with Jenkins pipeline, see the Containerization and orchestration of Talend microservices with Docker and Kubernetes article, in the Talend Community Knowledge Base (KB).
In Airflow, a Directed Acyclic Graph (DAG) is a collection of all the tasks you want to run, organized in a way that reflects the relationship between the tasks and their dependencies.
In the preparation_files.zip file, open the Job_Tmap_1.py file to see the sample DAG, DAG_tmap_1.
Build an Airflow Docker image, using the sample DAG in the dags folder.
docker build -t user/docker-airflow-aws-eks:1.10.2 .
Start Airflow using docker-compose.
docker-compose -f docker-compose-CeleryExecutor.yml up -d
Log in to the Airflow Web UI and notice that a new DAG is created automatically. Trigger the DAG manually.
In DAG_tmap_1, notice that two tasks, dummy_task_tmap1 and kubernetes_operator_task_tmap_1,were created.
Review the task execution logs for kubernetes_Operator_task_tmap_1.
Verify the status and logs of the tmap1 Pod using Kubectl.
Open the Kubernetes dashboard. Notice that a new Pod is created for task, tmap1, and that the task is terminated after a successful execution.
In this article, you leveraged the Docker build feature of Talend Studio to build containers, published Jobs to ECR, provisioned Talend Jobs to scalable platforms like Kubernetes, and understand the benefits of containerization and serverless platforms. You also learned how to install Airflow with Docker, create scalable workers using CeleryExecutor, and understand the scheduling and integration features of Airflow.