topic Re: Load data from on-prem data source to AWS s3 partition in Talend Studio

Load data from on-prem data source to AWS s3 partition

ffanali0804 — Fri, 15 Nov 2024 22:22:56 GMT

Hello

Me and my team we are building the company data lake.

The approach used so far has been to copy the entire tables from the source and save them as a single parquet file on s3.

Instead of saving everything in one file, we would like to partition the data by one or more columns in the file.

Can you tell me whether this is possible with Talend and how?

Thanks a lot.

Re: Load data from on-prem data source to AWS s3 partition

Anonymous — Mon, 14 Nov 2022 13:04:20 GMT

Hello,

I'd do this in a slightly different approach.

We can generate parquet files using: https://help.talend.com/r/en-US/8.0/parquet/tfileoutputparquet-standard-properties

We can upload them to S3 using tS3Put component. And you could specify a path that you wish. This path could contain the partition name.

Lets assume you want to partition by year. (I'll use year 2022)

Your data loader job would need to have a filter for a given year and your S3Put would have to upload the parquet file(s) to that folder /year=2022/

You'd need an orchestrator job that calls the data loader via a tRunJob. In this orchestrator you'd do:

select extract(year from myDate) from myTable group by 1

Then this value is then passed to the child job.

On the tFlowToIterate link you can also enable parallel execution, allowing multiple threads to be used to extract data from the database speeding up the operation.

so job1:

DBInput -> flowToIterate -> tRunJob (to list the partitions)

job2:

DbInput -> ParquetOutput + S3Put (to create extract of 1 partition + upload it to S3)

Cheers,

Balázs

Re: Load data from on-prem data source to AWS s3 partition

ffanali0804 — Tue, 15 Nov 2022 10:04:58 GMT

hello Balázs

thank you!! I tried your solution and it works 🙂

now I would try to add the month and day partitions.

thank you

Federico

Re: Load data from on-prem data source to AWS s3 partition

Anonymous — Tue, 15 Nov 2022 11:12:30 GMT

Hello,

Not necessarily, for example you could:

SELECT 'myDate = ' ||myDate::date as filter, 'year='||extract(year from myDate)||'/month='||extract(month from myDate)||'/day='||extract(day from myDate) as path, count(*) expected_rows

FROM myTable

GROUP BY 1,2

ORDER BY 3 desc

So the same 2 jobs approach is enough just tune your parameters

Then this would give you the filter expression + path that you can pass to the child job. (Prioritizing the big tables will reduce the overall runtime in case of multiple threads used. )

Re: Load data from on-prem data source to AWS s3 partition

ffanali0804 — Tue, 15 Nov 2022 17:45:02 GMT

thank you for your precious help.

I am currently importing several files from different on-prem tables into s3. Every day I load the whole table into one file. I then use these files to aggregate the data and make a new parquet, which I then load onto a redshift table, from which this table reads tableau.

By partitioning the data in this way, however, I could no longer download a single file but N, a solution that I would exclude...

At this point the only solution I can think of is to add to a redshift table the partition of the day just loaded and aggregate the data no longer from the parquet files but from the redshift tables.

Can you confirm this or do you think there is a better solution?

Thank you so much,

Federico

Re: Load data from on-prem data source to AWS s3 partition

ffanali0804 — Mon, 12 Dec 2022 15:53:26 GMT

hello @Balazs Gunics

I have a new question about this post! 🙂

Is there a solution to read the entire contents of a partitioned bucket in Talend Data Integration?

Thank you

Federico

Re: Load data from on-prem data source to AWS s3 partition

Anonymous — Mon, 12 Dec 2022 16:30:54 GMT

Hello,

I don't recall seeing this request for studio however for other components it does exists.

Components that are done on the new Talend Component Kit Framework do have Input available. (take a look at tAzureADLSGen2 components) That Input basically combines a: StorageList + StorageGet + FileInput into 1 component.

The Get component combines a StorageList + StorageGet and is able to download "folders".

The Input stream content of the file to studio.

For ADLSGen2 we have a feature: TDI-48801 That asks for the following functionality:

An "include subdirectory" checkbox has been added to the tAzureAdlsGen2Get component.

It would be useful to add the same functionality to the tAzureAdlsGen2Input component as well.

Use-case:

Partitioned tables are often stored this way:

table/year=2022/month=11/day=25/

We could easily load every partition by doing a recursive list.

This requires the schema to be compatible. In case it is we can load all the files from multiple folders.

I beleive that you're looking for the same "include subdirectory" for S3 as well. Unfortunately it will only be possible once we rewrite the entire component family.

As I believe you're using the enterprise version feel free to raise a support case to have this feature request tracked. Feel free to provide a link back to this discussion for them.

For now I think the best you can do is S3List + S3Get into a local folder and then load the data from there.

Re: Load data from on-prem data source to AWS s3 partition

ffanali0804 — Mon, 12 Dec 2022 17:06:15 GMT

thank you so much @Balazs Gunics

so I cannot point to the s3 bucket containing the partitions and read all the files at the same time as AWS Athena would do?

using S3List + S3Get would mean downloading all the files locally and then reading them iteratively to create a single file... I don't think it's that efficient....

Would using Talend big data solve this problem?

thank you

Federico