Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
J_Ruiz
Contributor II
Contributor II

Workaround for java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser in tFileOutputDelimeted

Greetings. I am currently developing a Spark Batch job (it's a subjob for a Standard Job) in Talend Big Data R2020-09-7.3.1 with remote connections to a TAC and a CDH 6.3.2 (Spark ver. 2.4.0). To launch it, I use the JobServer and the installed Yarn cluster in the Cloudera.

0695b00000N1MIkAAN.png

It works fine, most of the time. However, sometimes I stumble upon this error:

org.apache.spark.SparkException: Job aborted.

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)

at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)

at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)

at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)

at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)

at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)

at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)

at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)

at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:656)

at ... (private stuff)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:

Aborting TaskSet 5.0 because task 6 (partition 6)

cannot run anywhere due to node and executor blacklist.

Most recent failure:

Lost task 4.2 in stage 5.0 (TID 225, slaaeizeba13.enelint.global, executor 4): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3

at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)

at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)

at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)

at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)

at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)

at org.apache.spark.scheduler.Task.run(Task.scala:121)

at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)

It seems to come from the tFileOutputDelimeted I have in the job. I've been trying to look for a solution, but most internet posts talk about a Zeppelin installation.

WHAT I'VE TRIED:

  • Changing the CSV compression format.
  • Establishing a custom Spark serializer in the Spark configuration.
  • Pass the data to a parquet beforehand (that's the current state of the job)

All of these with no apparent change.

From what I've been reading, there is a fix for this in patch R2020-10 since it shaded the conflicting library (hive-exec) but this issue reappeared in later patches so I can't really be sure in upgrading the software to a certain patch, since that affects my entire working team.

Is there a workaround for this particulat issue that does not involve too much pressure on the father standard job?

Thank you in advance.

Labels (4)
2 Replies
Anonymous
Not applicable

Hello,

 

We did check on our side, and it does appear the only solution at this time would be to re-design the jobs to use spark and parquet files to handle those, instead of using regular hdfs (because right now the provided job does not use spark rdd / ds). Parquet can be used by Hive to create tables, in that instance.

 

This may not be acceptable due to a number of reasons, which we do understand. In the best case, we would recommend creating a support ticket and requesting the latest patch from Talend, the R2022-01 Monthly Patch. It will have that fix, along with all the new features and fixes from previous monthly patches.

 

Do note, this will require your team to re-build all jobs; however, that is required for the new code to be applied.

 

R2020-10 Patch Notes: https://help.talend.com/r/n2bsauB_Hr_Q9lJ_EJhj2A/E7pyrjELhypGWWqmdviWqw

 

R2022-01 Patch Notes: https://help.talend.com/r/dRSJ9~EAyTy01QgTh1JEuQ/ODB0uX0UJU2utsqxC77j4w

 

Thanks

J_Ruiz
Contributor II
Contributor II
Author

Hello rpooya, thanks for answering.

 

Sadly, it is not acceptable to use spark rdd and parquets in this scenario since we are not using these csv to create tables on hive, but they are rather the last output or the desired result of the job.

 

What I thought on doing, though, it's to substitute the tFileOutputDelimeted with pure java code; making a routine and calling it from the spark batch as a tJavaRow.

 

Thanks again for replying.