Using Talend Studio 6.4.1 with Cloudera CDH 5.11

TalendSolutionExpert · Apr 1, 2021 6:00:39 AM

Problem Description

Talend supports CDH 5.10 on Talend 6.4.1, but doesn't support CDH 5.11. CDH 5.12 is supported in Talend Winter ’18, so this means that there’s a gap in CDH official compatible versions. Moreover, you will not find CDH 5.11 among the list of distributions in Studio 6.5 this winter release, but you will find CDH 5.12.

Root Cause

CDH 5.12 has numerous features that far exceed CDH 5.11. Talend is working with all Hadoop platform providers to find a generic way of connecting to any version of a cluster, and that will address those skipped versions and more.

Environment

New single-node Cloudera Hadoop Cluster, installed on a Red Hat 7.4 AWS AMI using Cloudera Manager 5.11.

If you need to install an old version of CDH, see this link: Installing Older version of CDH.
Spark core distribution, Kafka 0.10.0, and Spark 2.1 (as Spark 1.6 is installed by default).

If you need to reproduce the same, see this link for Kafka: Install Kafka on CDH and this link for Spark2: Install Spark2 on CDH.
TAC and a JobServer are installed on the Cluster in order to be fully able to test Spark Jobs on Yarn.
Talend Studio 6.4.1 is installed on a Windows AWS machine, and a Hadoop cluster is installed on RedHat 7.4 on AWS.

Solution

Create a Hadoop cluster connection

In Talend Studio, from the Repository, right-click the Hadoop cluster and select Create a Hadoop Cluster.
The best practice when you want to connect to an unsupported version of a cluster is to find the closest supported version, and then manually specify the Hadoop services.

Select Custom unsupported for the distribution, then select Cloudera for the distribution and CDH5.10 Yarn mode for the version.
From there, you can replace the different libraries with the appropriate version of the jar.

You can find a great video tutorial here: How to add an unsupported Hadoop distribution to Talend Studio that describes how to find and replace a jar (it's made for HDP but reusable for CDH).
You may not have to go through the custom unsupported connection; you may be able to use the CDH5.10 distribution and use the Cloudera Manager Service to retrieve the cluster configuration and use CDH5.10 libraries.
Next, paste the hostname of your cluster and fetch Hadoop components.
Specify a user name and check the services.
In the metadata of the repository, under the Hadoop cluster, you should see the connection you just created.

Standard Jobs on the Cluster

Testing started with a very simple Job to put data in HDFS using a tHDFSPut component, adding a tHDFSconnection component, and the Job executed successfully.
Another Job used the same file, and loaded it into an Hbase table that stored the RAW data and another where blank values were replaced with Null in order to leverage Hbase sparse data capabilities.
The same was done with Hive, and all Jobs executed successfully.

Big Data Batch Job with MapReduce

Testing of Big Data Batch Jobs started with a very simple one, with one map and one reduce task that filters the data set and then aggregates the result of the filtering.
A more complex Job was tested, using a tMap component.

Big Data Batch Job with Spark

Spark 2.1 must be installed manually using Cloudera Manager and CSD parcels in order to test Spark 1.6 and 2.1 Jobs.

Convert your first Map Reduce Job to a Spark Job.
Add a tHDFSConnection for the storage.
Set up the Spark configuration of the Job to use the machine where Talend Studio is installed as a Spark driver (the Spark driver machine should be in the same private network as the cluster machine in order to resolve private IP within the VPC).
A more complex Job involved reading from an Hbase table and applying more complex aggregations.
Last, the Spark Machine learning Library was tested using the Naïve Bayes classification algorithm against the very famous Iris data set.
1. Creating the model with the training dataset:
2. Scoring the test dataset with the predictive model:
3. The accuracy of the predictive model is pretty good, only one the predictions went rogue:
The Spark Jobs were tested with both versions of Spark, and they executed successfully.

Big Data Streaming Job with Spark

Spark Streaming Jobs were tested using Kafka: one job produces and sends messages to a topic, and another Job consumes those messages.

Note: Be sure that you installed a version of Kafka that is supported by Talend Studio 6.4.1, in this case version 0.10.0.
The following Job produces and sends a message to a Kafka topic:
The following Job consumes those messages and shows them in a log table:
Everything worked well.

Conclusion

Even though version CDH 5.11 is not officially supported in Talend Studio 6.4.1, everything went well and the configuration was pretty straight forward, so you can still use your favorite data integration tool for most of the main critical tasks and even more. It's possible that every single detail was not tested, but overall no compatibility issues arose.

You are more than welcome to share your experiences, and address the issues you might be facing with this setup in the comment section.

Using Talend Studio 6.4.1 with Cloudera CDH 5.11