topic Re: Efficient Processing and Handling of API Calls for Large Name Dataset in Talend Studio

Efficient Processing and Handling of API Calls for Large Name Dataset

Artemis_Mercury — Fri, 15 Nov 2024 21:29:11 GMT

Hello,

I have a dataset containing 900,000 name records that I need to utilize as parameters for individual API GET calls. Unfortunately, the API only allows one name record per call, making bulk processing impractical.

I experimented with a smaller subset of 5,000 names, and it took around 15 minutes to complete the processing. To achieve this, I employed the tFlowToIterate component. This component facilitates the selection of one name at a time, which is then stored in a context variable and subsequently used as an input parameter for the API call.

If I were to extend this approach to the entire 900,000-name dataset, the processing time would extend to approximately 60 hours. My goal is to distribute this processing time over the p of 7 days.

Additionally, I am seeking guidance on how to handle potential API failures. It would be beneficial to have a strategy in place to identify the names associated with failed API calls, allowing for their reprocessing at a later time.

I appreciate any insights or suggestions you may have regarding an optimized job design for this scenario.

Thank you!

Re: Efficient Processing and Handling of API Calls for Large Name Dataset

anselmopeixoto — Tue, 15 Aug 2023 13:31:59 GMT

Hi @Lilian Ortiz Costa

I suggest you use tRESClient to make the API call and get both Response and Error output rows from it. Then you can use those outputs to identify the source record and update its status. This way you can filter the source dataset at each Job start to get only records that weren't processed successfully in the previous executions.

I also suggest you keep the "Die on error" option enabled on tRESTClient and use the OnComponentError trigger starting from this same component to identify fatal errors and also update the source record status.

Re: Efficient Processing and Handling of API Calls for Large Name Dataset

Anonymous — Wed, 16 Aug 2023 03:04:34 GMT

@Lilian Ortiz Costa , loading 900,000 name records into memory and iterate each name one by one will consume a lot of memory resources. I will suggest to try the following ways:

1- Split the data into a smaller subset, eg 5000 name per file, and then iterate each file.

data source--main--tFileOutputDelimited

|onsubjobok

tFlieList--iterate-->tFileInputDelimited--main--tFlowIterate--iterate--tRest (or tRestClient)-->out-->

//In the advanced settings panel of tFileOutputDelimited, check the 'split output to several files' box.

2- Allocate more memory to the job execution.

3- Enable parallel execution when using tFlowToIterate to iterate name and call API.