Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content
Announcements
Qlik Connect 2025! Join us in Orlando join us for 3 days of immersive learning: REGISTER TODAY

Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow

No ratings
cancel
Showing results for 
Search instead for 
Did you mean: 
TalendSolutionExpert
Contributor II
Contributor II

Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow

Last Update:

Jan 22, 2024 9:35:30 PM

Updated By:

Jamie_Gregory

Created date:

Apr 14, 2021 5:37:32 PM

This article shows you how to containerize, schedule, and provision your Talend ETL Jobs on serverless platforms like Amazon EKS leveraging Apache Airflow.

The article covers:

  • Docker an open-source platform that is used to create, deploy, and run applications by using containers

  • Kubernetes an open-source system for automating deployment, scaling, and the operations of application containers across clusters of hosts

  • Apache Airflow an open-source platform to programmatically author, schedule, orchestrate, and monitor workflows

Table of Contents:

 

Prerequisites

  • Familiarity with Docker, Kubernetes, Python, and Airflow
  • Talend 7.x installed
  • Knowledge of Amazon Web Services (AWS) such as EC2, ECR, EKS, RDS, and Elasticache
  • Administrator level access to create EKS cluster and privileges to launch AWS
  • Internet access to download required files and to access the required services on AWS
  • Download and extract the preparation_files.zip file (attached to this article)

Note: For information on installing Talend Software modules, see Talend Data Integration Installation Guide for Linux on the Talend Help Center.

 

Objectives and logical architecture

  1. Install Apache Airflow and its core components.
  2. Import the demo Job into Talend Studio.
  3. Publish the Job to Amazon ECR.
  4. Create a DAG in Airflow and provision the tasks to Kubernetes.

    0693p000008uLukAAE.jpg

     

Installation

The following sections guide you through the installation steps required to install Airflow.

Note: In this article, Airflow is installed with Docker on an EC2 instance. However, you can also install Airflow on containerized platforms like ECS or Kubernetes.

Creating and configuring an Amazon EKS cluster

  1. Create a Kubernetes cluster following the instructions in Getting Started with the AWS Management Console, and execute the steps to:

    1. Create your Amazon EKS Service Role.
    2. Create your Amazon EKS Cluster VPC.
    3. Install and configure Kubectl for Amazon EKS.
    4. Create your Amazon EKS Cluster.
    5. Create a Kubeconfig file.
    6. Launch and configure Amazon EKS worker nodes.
    7. Install the Kubernetes dashboard.

  2. After execution of all the steps above, ensure that you have:

    1. Created an EKS cluster and configured the worker nodes.
    2. Created the Kubeconfig file with master privileges.
    3. Downloaded the Kubectl command-line tool for querying the EKS Cluster.

  3. Verify that the worker nodes have joined the cluster successfully by using the Kubectl tool.

    0693p000008uLj9AAE.jpg

 

EKS Authentication and Authorization

Kubernetes supports several Authentication strategies and Authorization modes.

Amazon EKS uses IAM to provide authentication to our Kubernetes cluster through the AWS IAM Authenticator for Kubernetes. For authorization, EKS relies on native Kubernetes Role Based Access Control (RBAC).

Note: The security choices made in this article are completely based on what Amazon EKS supports while writing this article. Please verify the AWS EKS documentation for the latest security options.

In this article, the client application, Airflow, authenticates with the EKS API server using webhook tokens and performs tasks on the EKS cluster based on the system:masters permissions assigned to the k8s_role. Create an IAM k8s_role and an EKS RBAC configuration by performing the following steps.

  1. From the AWS Management Console, create an IAM role, for example, k8s_role.

    0693p000008uLupAAE.jpg

  2. Attach a policy to the k8s_role, for example, k8s_role_policy.

    Note: In the attachments, you'll find a sample k8s_role_policy that needs to be assumed by the client applications, for example, Airflow running on an EC2 instance, so that it can access other AWS services such as EKS, ECR, IAM, ECS. Your EKS cluster administrator can create this role with the appropriate permissions.

    0693p000008uLuuAAE.jpg

  3. Update the aws-auth ConfigMap and map the newly created k8s_role, by following the instructions on how To apply the aws-auth ConfigMap to your cluster. An example is available in the aws-auth-cm.yaml file (attached to this article). Configure the mapRoles and rolearn elements (highlighted in red below) using your settings.

    0693p000008uLvEAAU.png

    The Airflow EC2 instance is now configured to assume the IAM k8s_role and to use the RBAC Authorization.

     

Installing Airflow with Docker

Before starting with the installation process, you need to understand all the components in Airflow.

  • Webserver accepts HTTP requests and allows the user to run manually and monitor the execution status of all the tasks scheduled with a Directed Acyclic Graph (DAG)
  • Scheduler is responsible for scheduling, monitoring, and triggering of the 'tasks of the pipelines' according to their dependency
  • Database stores all the metadata such as DAGs and their execution status
  • Workers listen to one or multiple queues of tasks and execute the actual logic of the tasks

Note: As shown in the Architecture diagram, install all of the Airflow components, Airflow EC2 instance, RDS Database, and Redis in the same VPC.

  1. Create an RDS PostgreSQL database as backend.

    1. From the AWS Management Console, create a PostgreSQL database instance.

      0693p000008uLqtAAE.jpg

      Note: You can install any database backend that is supported by the SQLAlchemy library.

    2. Open the Security group, Edit Inbound rules and provide access to Airflow.

      0693p000008uLvJAAU.jpg

  2. Create a Redis Cluster as Celery backend.

    1. From the AWS Management Console, create an Elasticache cluster with Redis engine.

      0693p000008uLvYAAU.jpg

      Note: Airflow uses messaging techniques to scale out the number of workers, see Scaling Out with Celery Redis is an open-source in-memory data structure store, used as a database, cache and message broker.

    2. Open the Security group. Edit Inbound rules and provide access to Airflow.

      0693p000008uLvdAAE.jpg

 

Install Airflow on an EC2 instance

  1. Launch an EC2, Ubuntu Linux 18.04 instance (for example, t2.xlarge) and assign the k8s_role created in the previous step.

  2. Open the Security group. Edit Inbound rules and provide access to RDS and Redis services.

    0693p000008uLviAAE.jpg

  3. SSH to the Airflow EC2 instance, Create talend user:

    sudo adduser talend
    ##add talend user to sudoers
    sudo usermod -aG sudo talend
  4. Install Docker:

    sudo apt update
    ##Install docker from the official docker repository.
    sudo apt install apt-transport-https ca-certificates curl software-properties-common
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
    sudo apt update
    apt-cache policy docker-ce
    sudo apt install docker-ce
           ## verify docker daemon is in running state
    sudo systemctl status docker
          ##Add talend user to docker group
    sudo usermod -aG docker talend
  5. Install Docker Compose:

    sudo curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
      docker-compose --version
    
  6. Install AWS CLI using python3 and pip3:

    #Reference AWS documentation
    https://docs.aws.amazon.com/cli/latest/userguide/install-linux.html#install-linux-pip
    curl -O https://bootstrap.pypa.io/get-pip.py
    sudo apt-get install python3-distutils 
    python3 get-pip.py --user
    export PATH=~/.local/bin:$PATH
    source ~/.bashrc
    pip3 --version
    pip3 install awscli --upgrade --user

    Note: By default, python3 is installed on an EC2 Ubuntu instance.

  7. Configure AWS CLI:

    aws configure
    AWS Access Key ID [None]: provide_access_key
    AWS Secret Access Key [None]:provide_secret_key
    Default region name [None]: provide_region e.g: eu_central_1
    Default output format [None]: json
  8. Copy the Airflow setup files from the preparation_files.zip (attached to this article):

    mkdir  ~/airflow
    sudo apt-get install zip
    # Copy here the preparation files attached to this article 
    # Download preparation files using wget 
    wget https://community.talend.com/yutwg22796/attachments/yutwg22796/trb_develop/411/2/preparation_files.zip

    0693p000008uLvnAAE.jpg

     

  9. Configure the environment variables required for Airflow in the entrypoint.sh script file.

  10. Configure Airflow to communicate with its other core components, namely Database and Elasticache.

  11. Set the Database connection parameters and Message broker URL settings either in the Airflow configuration file, for example, airflow.cfg or as environment variables.

    Note: The Dockerfile (attached in this article) executes the entrypoint.sh script and environmental variables are used to set the connection parameters.

    cd ~/airflow/script
    vi entrypoint.sh
    ##Update REDIS & POSTGRES connection parameters and save the file

    0693p000008uLubAAE.jpg

  12. Copy your kubeconfig file to the Airflow config folder.

    Note: Use the kubeconfig file that was generated when you created the EKS cluster.

    cd ~/airflow/config
    # Now copy your EKS cluster kubeconfig file in this folder
    # After you copied manually verify the file is present using the ls command
    ls ~/airflow/config/config
    # Now you have to modify the "exec" command in your config file to use aws-iam-authenticator
    Note: aws-iam-authenticator should be used here to generate an identity token. The token is generated  based on your awscli credentials
    # For an example, look the syntax in the sample file sample_config_file
    # Your exec command in the file should look like below
    exec: {apiVersion: client.authentication.k8s.io/v1alpha1, args: [token, '-i', REPLACE_WITH_YOUR_KUBERNETE_CLUSTER_NAME], command: aws-iam-authenticator}
  13. Execute the docker build command and create an Airflow docker image:

    cd ~/airflow
    docker build -t xxx/docker-airflow-aws-eks:1.10.2 .
    Note: Replace the place holder xxx with your docker user
  14. Launch Airflow services using docker-compose:

    # Execute docker-compose command to launch all the components of airflow websever, scheduler, worker & flower
    docker-compose -f docker-compose-CeleryExecutor.yml up -d
  15. Verify the running containers.

    #List the running containers with the ps command
    docker ps

    Congratulations, you’ve installed Airflow.

  16. Log in to the Airflow Web UI.

    To access the Airflow Web UI, type the following URL into your browser.

    http://AIRFLOW_EC2_INSTANCE_PUBLIC_IP:8080/admin

    0693p000008uLvsAAE.jpg

     

Importing the demo Job in Talend Studio

  1. Download and import the demo Job, LOCAL.7z, (attached to this article) into Studio.

    0693p000008uLvxAAE.jpg

  2. Notice that the tMap_1 Job, uses a tFixedFlowInput component to generate 100 records, and that the tMap component converts the content in the txt column to uppercase and multiplies the number column with 10.

  3. Run the Job. It should output 100 records to your console.

    0693p000008uLw2AAE.jpg

     

Publishing the Job to Amazon ECR

Using Talend Studio, publish the Job to ECR.

  1. Create a repository in ECR.

    0693p000008uLw7AAE.jpg

  2. In Studio, right-click the Job and select Publish, then select Docker Image from the Export Type pull-down menu. Click Next.

  3. Fill in the ECR repository details. Click Finish.

    0693p000008uLqPAAU.jpg

    Note: Ensure the Registry + Image name matches the Repo URI created in Step 1.

  4. Confirm that the Job is published to the repository with version 0.1.0.

    0693p000008uLwCAAU.jpg

    For information on publishing Jobs using CI and with Jenkins pipeline, see the Containerization and orchestration of Talend microservices with Docker and Kubernetes article, in the Talend Community Knowledge Base (KB).

     

Creating a DAG and provision tasks to Kubernetes

In Airflow, a Directed Acyclic Graph (DAG) is a collection of all the tasks you want to run, organized in a way that reflects the relationship between the tasks and their dependencies.

  1. In the preparation_files.zip file, open the Job_Tmap_1.py file to see the sample DAG, DAG_tmap_1.

    0693p000008uLwMAAU.jpg

  2. Build an Airflow Docker image, using the sample DAG in the dags folder.

    docker build -t user/docker-airflow-aws-eks:1.10.2 .
  3. Start Airflow using docker-compose.

     docker-compose -f docker-compose-CeleryExecutor.yml up -d
  4. Log in to the Airflow Web UI and notice that a new DAG is created automatically. Trigger the DAG manually.

    0693p000008uLwWAAU.jpg

  5. In DAG_tmap_1, notice that two tasks, dummy_task_tmap1 and kubernetes_operator_task_tmap_1,were created.

    0693p000008uLmrAAE.jpg

  6. Review the task execution logs for kubernetes_Operator_task_tmap_1.

    0693p000008uLx0AAE.jpg

  7. Verify the status and logs of the tmap1 Pod using Kubectl.

    0693p000008uLvFAAU.jpg

  8. Open the Kubernetes dashboard. Notice that a new Pod is created for task, tmap1, and that the task is terminated after a successful execution.

    0693p000008uLx5AAE.jpg

     

Conclusion

In this article, you leveraged the Docker build feature of Talend Studio to build containers, published Jobs to ECR, provisioned Talend Jobs to scalable platforms like Kubernetes, and understand the benefits of containerization and serverless platforms. You also learned how to install Airflow with Docker, create scalable workers using CeleryExecutor, and understand the scheduling and integration features of Airflow.

Labels (1)
Version history
Last update:
‎2024-01-22 09:35 PM
Updated by: