Significance of Job Result Folder and Deployment B... - Qlik Community

dipanjan93 · ‎2021-05-27

Hi Guys,

What is the significance of Job Result Folder and Deployment Blob in tHiveConnection with Microsoft HDInsight Distribution ?

Essentially I'm getting below error while accessing sample hive table in HDInsight cluster -

Starting job test at 12:00 02/06/2021.

[statistics] connecting to socket on port 4065

[statistics] connected

[ERROR]: org.talend.bigdata.launcher.fs.AzureFileSystem - failed to create directory 'tmp'

com.microsoft.azure.storage.StorageException: The requested operation is not allowed in the current state of the entity.

at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87)

at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)

at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196)

at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadFullBlob(CloudBlockBlob.java:1035)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:864)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:743)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:712)

at org.talend.bigdata.launcher.fs.AzureFileSystem.mkdir(AzureFileSystem.java:141)

at org.talend.bigdata.launcher.webhcat.QueryJob.sendFiles(QueryJob.java:88)

at test_0_1.test.tHiveInput_2_OutProcess(test.java:965)

at test_0_1.test.tHiveConnection_3Process(test.java:702)

at test_0_1.test.runJobInTOS(test.java:2126)

at test_0_1.test.main(test.java:1774)

[ERROR]: org.talend.bigdata.launcher.fs.AzureFileSystem - failed to copy file 'test_1.hive'

com.microsoft.azure.storage.StorageException: The specified blob does not exist.

at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87)

at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)

at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196)

at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadFullBlob(CloudBlockBlob.java:1035)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:864)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:743)

at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:712)

at org.talend.bigdata.launcher.fs.AzureFileSystem.copyFromLocal(AzureFileSystem.java:173)

at org.talend.bigdata.launcher.webhcat.QueryJob.sendFiles(QueryJob.java:95)

at test_0_1.test.tHiveInput_2_OutProcess(test.java:965)

at test_0_1.test.tHiveConnection_3Process(test.java:702)

at test_0_1.test.runJobInTOS(test.java:2126)

at test_0_1.test.main(test.java:1774)

Job test ended at 12:00 02/06/2021.

I tried providing table file location as well as temporary blob location, tested in both Adls gen2 as well as blob storage but no luck.

Additionally, in the Windows Azure Configuration I tried with SAS key, Blob endpoint, File endpoint but couldn't get through.

Could you please let me know where exactly I'm going wrong ? Let me know if I'm missing over here.

dipanjan93 · ‎2021-06-02

@Nikhil Thampi / @Richard Hall / @Shicong Hong / @Fred Trebuchet / @Francois Denis - Could you please help here ?

Anonymous · ‎2021-06-02

@Dipanjan Mallick I'm afraid this is not something I am an expert in and am not set up to try this out. Let me ask around the team.

dipanjan93 · ‎2021-06-03

Awesome! Thanks for all the support.

Anonymous · ‎2021-06-04

@Dipanjan Mallick The components for hive when using Azure HD Insights, on the backend dont do the traditional JDBC connection like with Hive on a hadoop cluster. For HD Insights it is different and the components have to use WebHCat instead to do the connectivity per best practices by Microsoft and here is the document from Microsoft side:

https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-use-hive-curl

As you will see in the documentation the connectivity with WebHCat is an API one, and as a result it starts introducing the 2 locations that you see in the configuration of the component. Here is also a good article from Hive describing how WebHCat works:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat

Now as you see in this last document, you will see that it mentions the following:

Data and code that are used by HCatalog's REST resources must first be placed in Hadoop. When placing files into HDFS is required you can use whatever method is most convienient for you. We suggest WebHDFS since it provides a REST interface for moving files into and out of HDFS.

In the case of HD Insights instead of HDFS/WebHDFS this is going to be the azure storage that you attach by default to the HD Insights cluster that you create. Now more into specific here is the role that those 2 locations play:

In the Deployment Blob field, enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account.

In the Job result folder field, enter the location in which you want to store the execution result of a Job in the Azure Storage to be used.

Now based on the 2 documentations and the explanations above, as we can see the deployment Blob is were the Talend Hive job will upload the dependencies needed for WebHCat to run the code. and the job result foder is the location in which the execution result will be stored.

Also if you look into the Microsoft documentation they mention about the directories used like this:

statusdir - The directory that the status for this job is written to.

Once the state of the job has changed to SUCCEEDED, you can retrieve the results of the job from Azure Blob storage. The

statusdir

parameter passed with the query contains the location of the output file; in this case,

/example/rest

. This address stores the output in the

example/curl

directory in the clusters default storage.

To conclude you will need in the Hive Component to define the cluster storage information (Blob or ADLSGen2) and then the paths in that storage that you want to be leveraged.

dipanjan93 · ‎2021-06-11

@Petros Nomikos - Thanks mate, for the detailed explanation. However, I gave the information related to AdlsGen2 in the config section. But am still facing the same issue. Could you please confirm whether I can put details related to Deployment Blob/Job Result Folder to any folder in AdlsGen2 where my table is residing or I can use any other storage account too ?

Significance of Job Result Folder and Deployment Blob in tHiveConnection with Microsoft HDInsight Distribution

Azure

Cloud

Talend Big Data

Talend Cloud

Talend Studio

v7.x