Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I am running a Talend Big data job on spark but has failed due to null pointer issue and could not figure out the root cause for it.
Please refer below log for issue:
org.apache.spark.SparkException: Job aborted due to stage failure: java.lang.RuntimeException: java.lang.NullPointerException
at org.talend.bigdata.dataflow.functions.FlatMapperIterator.hasNext(FlatMapperIterator.java:75)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1195)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1195)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1195)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1277)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1203)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am doing right outer join operation using tSQLROW where I suspect the issue lies but dont know the root cause.
Thanks
For thereported stack trace in a Talend Big Data Spark job almost always means that Spark is trying to iterate over rows from a tSQLRow join result, and one of the incoming records has a null value in a field the join or subsequent mapping expects to be non-null.
Because the Talend FlatMapperIterator (mechanism that allows you to loop through a set of Records) is just the wrapper that reads and transforms Spark RDD/DataFrame rows, the NullPointerException comes from your job logic — not from Spark itself.
What is the possible RC this Happens in tSQLRow Component with an Outer Join Query:
When you run an Example Query like:
SELECT a.col1, b.col2
FROM tableA a
LEFT OUTER JOIN tableB b
ON a.key = b.key
- For rows in tableA that have no match in tableB, all b.* fields will be null.
- If any later component (or even the same tSQLRow) tries to access those fields without null checks (e.g., calling .toString(), arithmetic, trimming), you get an NPE.
In Spark’s Talend wrapper, FlatMapperIterator.hasNext() is often where it blows up — because Talend’s generated Java code directly calls getters on nullable objects.
Possible Steps How to Fix this issue:
1. Check the schema definition in Talend Job
- For the outer join result, all fields from the nullable side of the join must be set as nullable in the Talend schema.
- In the tSQLRow output schema editor, make sure “Nullable” is checked for all columns that can be null from the join.
2. Use IFNULL / COALESCE in your SQL
Example:
SELECT
a.col1,
COALESCE(b.col2, '') AS col2,
COALESCE(b.amount, 0) AS amount
FROM tableA a
LEFT OUTER JOIN tableB b
ON a.key = b.key
This avoids nulls reaching downstream components.
3. Add Null Checks Before Processing
If you’re passing join results into a tMap or another Spark component:
- Use Talend expressions like:
row2.col2 != null ? row2.col2 : ""
to safely handle null values.
4. Watch for Generated Code Assumptions:
- Talend’s Big Data Spark jobs generate Scala/Java that may assume a column is never null if “Nullable” is unchecked in the schema.
- If unchecked, Talend calls .toString() or primitive conversions without guards → instant NPE.
5. Additional Investigation Debug the Null Source
If you’re unsure which column is null:
- Temporarily change the join to an inner join — if the job runs, the nulls are from the missing side.
- Or, in tSQLRow, add a CASE WHEN b.key IS NULL THEN 'MISSING' ELSE 'OK' END column to trace null origins.
- Always check the Job Data flow for when using possibly Component such tFileInputDelimitted the records passing to tSQLrow is properly formatted and complies with the Spark version requirement