
Contributor II
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'
Problem Description
Running a Spark SQL query, in a Big Data Spark Job, that has a lot of small tables under the broadcast threshold, may fail with the following exception in the execution log.
Exception in thread "broadcast-hash-join-4" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:535) at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:403) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:128) at at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ... Exception in thread "broadcast-hash-join-1" java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:410) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:106)
Root Cause
This is Apache Spark issue, see https://issues.apache.org/jira/browse/SPARK-12358 for details.
Solution
To resolve this issue, perform the following steps:-
On your Spark Job, select the Spark Configuration tab.
-
In the Advanced properties section, add the following parameter "spark.sql.autoBroadcastJoinThreshold" and set the value to "-1".
-
Regenerate the Job in TAC.
-
Run the Job again.
372 Views