Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik GA: Multivariate Time Series in Qlik Predict: Get Details
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

read big bz2 in spark (Big Data Batch)

I'm trying to read a big bz2 file in spark batch (the file is in hdfs). I noticed that the spark job
is not splitting the file, and only using one executor to read the whole file. Takes more than 1 hour!.
The component that I'm using to read the file is tFileInputDelimited on Big Data Batch.
I analyzed the code generated and found out that the argument  minPartitions in ctx.hadoopRDD 
is not being used.
I'm wondering if there is any way to specify the number of partitions, so many executors can be 
generated and the time to read the bz2 is decreased.
Thanks.-
Labels (2)
1 Reply
Anonymous
Not applicable
Author

Hi,
Could you please indicate the build version you are using? What does your spark job look like? Could you please post your work flow screenshot into forum?
Best regards
Sabrina