Using Hive components in MapR Spark Jobs to read Hive MapR-DB tables

TalendSolutionExpert — Fri, 09 Feb 2024 19:06:24 GMT

Overview

This article explains how to use Talend Hive components in MapR Spark Batch Jobs to read from Hive MapR-DB tables. As MapR provides the ability to query MapR-DB tables through a Hive View, this article also covers how to set up Talend Jobs to read from a Hive View of MapR-DB table.

Environment

Talend Studio 6.5.1
MapR 6.0.1

Prerequisites

Setting up MapR

Set up the MapR Client 6.0.1 to connect with your MapR cluster on the system you are using to run your Job. For more information on setting up a MapR Client, see the MapR 6.1 documentation, Installing the MapR Client page.
After setting up the MapR Client, generate a MapR ticket that your Job can utilize to communicate with the cluster:

Setting up Studio

Ensure that your Studio has access to all of the Cluster nodes, and that they can reach back to your Studio per the Spark Security documentation, since Talend utilizes the YARN-Client paradigm that has the Spark driver spun up at the same location as the Job it is run from.
Configure the Hadoop Cluster connection in metadata in Studio.
1. Right-click Hadoop Cluster, then click Create Hadoop Cluster.
2. Select the distribution and version of your Hadoop cluster, then select Import configuration from local files. Click Next.
3. Ensure your system has a local copy of the hive-site.xml, mapred-site.xml and yarn-site.xml files to import in to the Hadoop metadata wizard.
4. Import the cluster configuration files.
5. Notice that after the configuration files are imported, not all of the information on the next screen is populated, and it gives you a warning that the Resource Manager needs to be specified. This is because there are no specific hostnames included in the configuration files for the Resource Manager and CLDB nodes. You need them though later in this article, as they contain properties that will help with utilizing the Resource Manager HA.
6. To fully utilize the CLDB and Resource Manager HA, complete the wizard as shown below:
7. Once the cluster information is populated, click Check Services to ensure that Studio can connect successfully to the cluster.

Building the Job

Right-click Job Designs, click Create Big Data Batch Job, then give it a name.
From the Hadoop Cluster connection you created earlier, drag the HDFS connection to the canvas, then select to enter a tHDFSConfiguration component. Notice that it populates in right away, and in the Run tab, the Spark Configuration information is completed for you. This information tells the Job how to communicate with Spark.
Again, using the Hadoop Cluster connection you created earlier, drag the Hive Connection to the canvas, then select to enter a tHiveConfiguration component.
For each of the following libraries, use a tLibraryLoad component referencing each one. The Hive components use these libraries to retrieve the data from the Hive view of your MapR-DB table:

hbase-common-1.1.8-mapr-1710.jar
hbase-client-1.1.8-mapr-1710.jar
hbase-server-1.1.8-mapr-1710.jar
hbase-spark-1.1.8-mapr-1710.jar
hbase-protocol-1.1.1-mapr-1710.jar
hive-hbase-handler-2.1.1-mapr-1710.jar
mapr-hbase-6.0.1-mapr.jar
maprdb-6.0.1-mapr.jar

Add a tHiveInput component and configure it to read from the Hive View of your MapR-DB table.
Configure this component to output the values of the table to a tLogRow to ensure you can successfully read the table.
The complete Job should look like this:

Running the Job

Run the Job to see if you successfully connected to the Hive View, and can read the MapR-DB table data.

Additional notes

The same Job design will work for MapR 5.2.0 and above.

You can utilize MapR 6.0.1 in Talend 6.5.1 through a patch, available from Talend Support, that adds it as a supported version.

article Using Hive components in MapR Spark Jobs to read Hive MapR-DB tables in Official Support Articles