Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all, we're pushing parquet files to ADLS with the tAzureAdlsGen2Output component.
When we encounter a large input file or table and the process runs 1hour+, it fails with a 401. It seems we run into that limitation: https://docs.microsoft.com/en-us/azure/databricks/kb/data-sources/job-fails-adls-hour
Anyone has faced similar issue? How can we work around this? Ideally, the talend component would refresh the passthrough token by itself.
We have to use Azure AD for security constraints =/
Hello,
Are you using the default max batch size on the tAzureAdlsGen2Output component? Is it possible to optimize performance on the tAzureADLSGen2Output component so that it doesn't require 1 hours+?
Best regard
Sabrina
Hey, I already tried tweaking the Max Batch Size and was able to get to ~200000 before having "request too big" errors. But the subjob still need more then 1 hour =(
Hello,
Would you mind posting your job design screenshots on community which will be helpful for us to get more details and information about your current situation.
Please mask your sensitive data as well.
Best regards
Sabrina
Sure, here the screenshots and logs of the error at the end. (The error always happen after 1 hour.)
I got the same behavior with large csv files as input, or large DB tables.
Thanks!
[INFO ] 10:33:02 org.apache.parquet.hadoop.InternalParquetRecordWriter- Flushing mem columnStore to file. allocated memory: 2664135
[INFO ] 10:33:27 org.apache.parquet.hadoop.InternalParquetRecordWriter- Flushing mem columnStore to file. allocated memory: 2411351
[INFO ] 10:33:56 org.apache.parquet.hadoop.InternalParquetRecordWriter- Flushing mem columnStore to file. allocated memory: 2223920
[INFO ] 10:34:22 org.apache.parquet.hadoop.InternalParquetRecordWriter- Flushing mem columnStore to file. allocated memory: 2369118
[INFO ] 10:34:48 org.apache.parquet.hadoop.InternalParquetRecordWriter- Flushing mem columnStore to file. allocated memory: 2419499
[ERROR] 10:34:49 org.talend.components.adlsgen2.service.AdlsGen2Service- [handleResponse] InvalidAuthenticationInfo [401]: Authentication information is not given in the correct format. Check the value of Authorization header..
[ERROR] 10:34:49 org.talend.components.adlsgen2.output.AdlsGen2Output- [afterGroup] InvalidAuthenticationInfo [401]: Authentication information is not given in the correct format. Check the value of Authorization header..
[FATAL] 10:34:49 JobX- tAzureAdlsGen2Output_1 (org.talend.components.adlsgen2.runtime.AdlsGen2RuntimeException) InvalidAuthenticationInfo [401]: Authentication information is not given in the correct format. Check the value of Authorization header..
org.talend.sdk.component.api.exception.ComponentException: (org.talend.components.adlsgen2.runtime.AdlsGen2RuntimeException) InvalidAuthenticationInfo [401]: Authentication information is not given in the correct format. Check the value of Authorization header..
Hello,
The lifetime of an Azure AD pass through token is one hour. When a command is sent to the cluster that takes longer than one hour, it fails if an ADLS resource is accessed after the one hour mark. This is a known issue.
As we known that, it is not possible to increase the lifetime of an Azure AD pass through token. The token is retrieved by the Azure Databricks replicated principal. You cannot edit its properties.
Could you please try to rewrite your queries, so that no single command takes longer than an hour to complete?
Best regards
Sabrina