topic Split a spark streaming job due to Java 65535 bytes limit in Talend Studio

Split a spark streaming job due to Java 65535 bytes limit

vradhik — Fri, 15 Nov 2024 21:44:13 GMT

Hello

We have a spark streaming Talend job that consumes events in json format from kafka and writes to hive. The input is a large json with 500+ attributes and ending up in 64K byte limit on the method generated for the subjob.

I understand the best way to work around this is to split the subjob but with the streaming job, that is not possible. Are there any suggestions/pointers to work around this?

We have the following flexibility, if any of this helps..

split the single hive table into 2 with a common key so we can join data from 2 tables when needed.
Not necessary to maintain the order of the events when persisting to hive
have the event sent as Avro instead of json (not tried, but should be able to do that)

Thanks

Radhika

Re: Split a spark streaming job due to Java 65535 bytes limit

Anonymous — Tue, 06 Jun 2023 02:59:48 GMT

Take a look at these KB articles about this Java 65535 bytes limit error.

https://community.talend.com/s/article/Exceeding-the-Java-bytes-limit-1Z1UZ

https://community.talend.com/s/article/Building-a-Job-with-one-tExtractPositionalFields-component-fails-with-the-error-The-code-of-method-is-exceeding-the-bytes-limit-17gnl

https://community.talend.com/s/article/tMSSqlInput-Process-Map-String-Object-is-exceeding-the-bytes-limit-InMpE

The workaround is optimizing the Job to reduce the size of the final generated code of a subjob. Try the following:

Minimize the number of components in the subjob.
Divide the subjob into several subjobs.
Reduce the number of columns.

In your case, I think option 1 may be a solution that can be tried.

Regards

Shong