Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

TalendSolutionExpert · Apr 1, 2021 6:23:03 AM

Problem Description

Running a Spark SQL query, in a Big Data Spark Job, that has a lot of small tables under the broadcast threshold, may fail with the following exception in the execution log.

Exception in thread "broadcast-hash-join-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:535)
at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:403)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:128)
at
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...
Exception in thread "broadcast-hash-join-1" java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:410)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:106)

Root Cause

This is Apache Spark issue, see https://issues.apache.org/jira/browse/SPARK-12358 for details.

Solution

To resolve this issue, perform the following steps:

On your Spark Job, select the Spark Configuration tab.
In the Advanced properties section, add the following parameter "spark.sql.autoBroadcastJoinThreshold" and set the value to "-1".
Regenerate the Job in TAC.
Run the Job again.

Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

Problem Description

Root Cause

Solution

Formerly Talend

Execute SQL Set statements or Non Select Queries

Qlik Replicate: SQL Server Source Query Timeout fo...

Qlik Replicate SQL Server source queries fn_dump_d...

Re: Query timeout expired

Qlik Application Automation: Copy data to Microsof...