Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
NakulanR
Partner - Contributor III
Partner - Contributor III

Databricks cluster resizing causes 503 error in Replicate

Hi Support,

 

We are seeing an issue where Replicate reports an error with the following message when writing to Databricks (Delta): "RetCode: SQL_ERROR SqlState: 08S01 NativeError: 124 Message: [Simba][Hardy] (124) A 503 response was returned but no Retry-After header was provided. Original error: Unknown".

The timestamps of these errors match up to timestamps on the Databricks side when the Databricks cluster was being resized as a result of auto-scaling. However, the Databricks (Delta) endpoint limitations don't have any mention of auto-scaling/cluster resizing being unsupported.

Is this a known issue when using the Databricks (Delta) endpoint with auto-scaling enabled? If so, is there a workaround that can be implemented in Replicate to prevent the error occurring when the Databricks cluster is being resized?

 

Thanks,

Nak

Labels (3)
1 Solution

Accepted Solutions
SachinB
Support
Support

Hello @NakulanR ,

If the connection issues in Databricks are due to auto-scaling, you can increase the wait period for executions by setting the internal parameters loadTimeout, executeTimeout/CDCTimeout to 10 times their current values. This adjustment helps prevent timeouts during scaling operations.

Hope this helps.

Regards,

Sachin B

View solution in original post

4 Replies
SushilKumar
Support
Support

Hello @NakulanR 

Hope below link may help .

https://docs.databricks.com/api/workspace/clusters/resize

 

Regards,
Sushil Kumar

SachinB
Support
Support

Hello @NakulanR ,

Thanks for contacting Qlik community forum.

Based on the provided error message "A 503 response was returned but no Retry-After header was provided" means that the target server was temporarily unavailable. This could be due to a number of reasons, such as the server being overloaded or under maintenance.

Can you validate that the there is no connection related issues to your Databricks? Like, uploading csv from another server.

Can you try pinging the databricks server from the Replicate server and see if anything gets to it?

Here is the explanation for the error. This were returned by the Databricks cluster, This needs to be verified by the Databricks team.

https://community.databricks.com/t5/data-engineering/how-to-fix-intermittent-503-errors-in-10-4-lts/...

Regards,

Sachin B



NakulanR
Partner - Contributor III
Partner - Contributor III
Author

Hi Sachin,

The error appears and as a result the endpoint gets disconnected. A few minutes later the endpoint gets reconnected on its own, and the task is back up and running. We are able to determine that this occurs when the auto-scaling resizes the cluster. Testing the connection normally to Databricks yields a successful test connection.

If this is occurring as a result of some sort of timeout disconnect on Databricks whilst the auto-scaling is happening, would using the loadTimeout or executeTimeout internal parameters be of any use? Or is there a Databricks specific internal parameter that can be used?

 

Regards,

Nak

SachinB
Support
Support

Hello @NakulanR ,

If the connection issues in Databricks are due to auto-scaling, you can increase the wait period for executions by setting the internal parameters loadTimeout, executeTimeout/CDCTimeout to 10 times their current values. This adjustment helps prevent timeouts during scaling operations.

Hope this helps.

Regards,

Sachin B