I'm trying to read a big bz2 file in spark batch (the file is in hdfs). I noticed that the spark job
is not splitting the file, and only using one executor to read the whole file. Takes more than 1 hour!.
The component that I'm using to read the file is tFileInputDelimited on Big Data Batch.
I analyzed the code generated and found out that the argument
minPartitions in ctx.hadoopRDD
is not being used.
I'm wondering if there is any way to specify the number of partitions, so many executors can be
generated and the time to read the bz2 is decreased.
Thanks.-
Hi, Could you please indicate the build version you are using? What does your spark job look like? Could you please post your work flow screenshot into forum? Best regards Sabrina