Skip to main content
Announcements
Qlik Connect 2025! Join us in Orlando join us for 3 days of immersive learning: REGISTER TODAY

Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

No ratings
cancel
Showing results for 
Search instead for 
Did you mean: 
TalendSolutionExpert
Contributor II

Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

Last Update:

Jan 22, 2024 9:35:30 PM

Updated By:

Jamie_Gregory

Created date:

Apr 1, 2021 6:23:03 AM

Problem Description

 

Running a Spark SQL query, in a Big Data Spark Job, that has a lot of small tables under the broadcast threshold, may fail with the following exception in the execution log.

 

Exception in thread "broadcast-hash-join-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:535)
at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:403)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:128)
at
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...
Exception in thread "broadcast-hash-join-1" java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:410)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:106)

 

Root Cause

This is Apache Spark issue, see https://issues.apache.org/jira/browse/SPARK-12358 for details.

 

Solution

To resolve this issue, perform the following steps:
  1. On your Spark Job, select the Spark Configuration tab.

  2. In the Advanced properties section, add the following parameter "spark.sql.autoBroadcastJoinThreshold" and set the value to "-1".

  3. Regenerate the Job in TAC.

  4. Run the Job again.

Labels (1)