Solved: Databricks cluster resizing causes 503 error in Re... - Qlik Community

NakulanR · ‎2024-05-22

Hi Support,

We are seeing an issue where Replicate reports an error with the following message when writing to Databricks (Delta): "RetCode: SQL_ERROR SqlState: 08S01 NativeError: 124 Message: [Simba][Hardy] (124) A 503 response was returned but no Retry-After header was provided. Original error: Unknown".

The timestamps of these errors match up to timestamps on the Databricks side when the Databricks cluster was being resized as a result of auto-scaling. However, the Databricks (Delta) endpoint limitations don't have any mention of auto-scaling/cluster resizing being unsupported.

Is this a known issue when using the Databricks (Delta) endpoint with auto-scaling enabled? If so, is there a workaround that can be implemented in Replicate to prevent the error occurring when the Databricks cluster is being resized?

Thanks,

Nak

SachinB · ‎2024-05-22

Hello @NakulanR ,

If the connection issues in Databricks are due to auto-scaling, you can increase the wait period for executions by setting the internal parameters loadTimeout, executeTimeout/CDCTimeout to 10 times their current values. This adjustment helps prevent timeouts during scaling operations.

Hope this helps.

Regards,

Sachin B

View solution in original post

SushilKumar · ‎2024-05-22

Hello @NakulanR

Hope below link may help .

https://docs.databricks.com/api/workspace/clusters/resize

Regards,
Sushil Kumar

SachinB · ‎2024-05-22

Hello @NakulanR ,

Thanks for contacting Qlik community forum.

Based on the provided error message "A 503 response was returned but no Retry-After header was provided" means that the target server was temporarily unavailable. This could be due to a number of reasons, such as the server being overloaded or under maintenance.

Can you validate that the there is no connection related issues to your Databricks? Like, uploading csv from another server.

Can you try pinging the databricks server from the Replicate server and see if anything gets to it?

Here is the explanation for the error. This were returned by the Databricks cluster, This needs to be verified by the Databricks team.

https://community.databricks.com/t5/data-engineering/how-to-fix-intermittent-503-errors-in-10-4-lts/...

Regards,

Sachin B

NakulanR · ‎2024-05-22

Hi Sachin,

The error appears and as a result the endpoint gets disconnected. A few minutes later the endpoint gets reconnected on its own, and the task is back up and running. We are able to determine that this occurs when the auto-scaling resizes the cluster. Testing the connection normally to Databricks yields a successful test connection.

If this is occurring as a result of some sort of timeout disconnect on Databricks whilst the auto-scaling is happening, would using the loadTimeout or executeTimeout internal parameters be of any use? Or is there a Databricks specific internal parameter that can be used?

Regards,

Nak

SachinB · ‎2024-05-22

Hello @NakulanR ,

If the connection issues in Databricks are due to auto-scaling, you can increase the wait period for executions by setting the internal parameters loadTimeout, executeTimeout/CDCTimeout to 10 times their current values. This adjustment helps prevent timeouts during scaling operations.

Hope this helps.

Regards,

Sachin B

NakulanR · ‎2025-02-04

Hi @SachinB,

We're still seeing the same 503 error when using Databricks as a target. The error occurs and then Replicate recovers by itself a few minutes later. This is happening during the full load so the full load needs to be started from the beginning each time.

On the Databricks end the compute loses a node due to spot instance termination, however this disconnect is temporary. The loadTimeout and executeTimeout values have been increased to 10x their original value as suggested.

Is there another parameter that can be used to ensure that the full load isn't terminated and doesn't need to be started again?

Regards,

Nak

Databricks cluster resizing causes 503 error in Replicate

Best Practices

Configuration

Errors - Unexpected Behavior