S[ark Streaming creating multiple hdfs directory

Anonymous · ‎2018-03-06

Hi All,

I have a spark streaming job which is consuming messages from MapR stream. I am trying to put the messages into HDFS location and from there I am trying to process it using a batch process.

The problem is every batch (I have set a 2 minutes batch) in the streaming job is creating a separate directory in HDFS with a timestamp value. I am not sure how to merge all the files for a particular day and feed the merged file to my end of day batch job to be processed.

Can anybody please help here?

Anonymous · ‎2018-03-06

You can use tHiveoutput component to append to the same directory, which invokes df.append method in partitioned manner.One way i could think of.

Big Data