Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik GA: Multivariate Time Series in Qlik Predict: Get Details
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Need Suggestion for Best Practice to choose Big Data Batch (with Spark) or Standard Job with tSqoop (with MR)

Hi expert,

 

first of all, I haven't seen any Main Topic for Big Data Discussion, like in your old Forum. Only BD sandbox which currently available. So I decided to ask here.

Any of you can give better idea, which one should we choose when we want to do the Data Ingestion from RDBMS to HDFS/Hive.
Been thinking of these 2 ways, please give the idea which one is better (or any other ways better):

1. In Standard Job: tSqoopImport --component ok--> tHiveLoad
OR
2. In Big data batch Job (Spark) : tXXXInput (RDBMS, such as Oracle/mssql/etc)  --main Job--> tFileOutputDelimited (to put to the HDFS) --> Load to Hive from HDFS
or maybe any of you have any better solution?

Huge thanks

Labels (4)
2 Replies
Anonymous
Not applicable
Author

Hi,

You can import data from RDBMS to hadoop using sqoop without using tHiveLoad
Please take a look at a related scenario in component reference about:TalendHelpCenter:tSqoopImport
Best regards
Sabrina

Anonymous
Not applicable
Author

Hi @xdshi, thanks for the reply.

Yes currently I'm using tSqoopImport and brings the data to HDFS, and since the destination is to Hive, so I used tHiveLoad.

My main confusion is, with the same scenario (RDBMS source to Hive), If I'm not mistaken, it is also possible to use Big data Batch Job, which can perform the task using Spark framework. The flow components more or less will be like the statement in point number 2

So back again to the question, which would be faster? (or the best practice)
Using big data standard job for ingestion from rdbms.
Or use Big Data Batch Job using Spark Framework?
CMIIW.

Thanks