<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Load data from on-prem data source to AWS s3 partition in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331855#M100795</link>
    <description>&lt;P&gt;hello Balázs&lt;/P&gt;&lt;P&gt;thank you!! I tried your solution and it works &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;now I would try to add the month and day partitions. &lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thank you &lt;/P&gt;&lt;P&gt;Federico&lt;/P&gt;</description>
    <pubDate>Tue, 15 Nov 2022 10:04:58 GMT</pubDate>
    <dc:creator>ffanali0804</dc:creator>
    <dc:date>2022-11-15T10:04:58Z</dc:date>
    <item>
      <title>Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331853#M100793</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;Me and my team we are building the company data lake.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The approach used so far has been to copy the entire tables from the source and save them as a single parquet file on s3.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Instead of saving everything in one file, we would like to partition the data by one or more columns in the file.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you tell me whether this is possible with Talend and how?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot.&lt;/P&gt;&lt;P&gt;F&lt;/P&gt;</description>
      <pubDate>Fri, 15 Nov 2024 22:22:56 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331853#M100793</guid>
      <dc:creator>ffanali0804</dc:creator>
      <dc:date>2024-11-15T22:22:56Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331854#M100794</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'd do this in a slightly different approach.&lt;/P&gt;&lt;P&gt;We can generate parquet files using: https://help.talend.com/r/en-US/8.0/parquet/tfileoutputparquet-standard-properties &lt;/P&gt;&lt;P&gt;We can upload them to S3 using tS3Put component. And you could specify a path that you wish. This path could contain the partition name.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Lets assume you want to partition by year. (I'll use year 2022)&lt;/P&gt;&lt;P&gt;Your data loader job would need to have a filter for a given year and your S3Put would have to upload the parquet file(s) to that folder /year=2022/&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You'd need an orchestrator job that calls the data loader via a tRunJob. In this orchestrator you'd do:&lt;/P&gt;&lt;P&gt;select extract(year from myDate) from myTable group by 1&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then this value is then passed to the child job.&lt;/P&gt;&lt;P&gt;On the tFlowToIterate link you can also enable parallel execution, allowing multiple threads to be used to extract data from the database speeding up the operation.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so job1:&lt;/P&gt;&lt;P&gt;DBInput -&amp;gt; flowToIterate -&amp;gt; tRunJob  (to list the partitions)&lt;/P&gt;&lt;P&gt;job2:&lt;/P&gt;&lt;P&gt;DbInput -&amp;gt; ParquetOutput  + S3Put (to create extract of 1 partition + upload it to S3)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;  Balázs&lt;/P&gt;</description>
      <pubDate>Mon, 14 Nov 2022 13:04:20 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331854#M100794</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-11-14T13:04:20Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331855#M100795</link>
      <description>&lt;P&gt;hello Balázs&lt;/P&gt;&lt;P&gt;thank you!! I tried your solution and it works &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;now I would try to add the month and day partitions. &lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thank you &lt;/P&gt;&lt;P&gt;Federico&lt;/P&gt;</description>
      <pubDate>Tue, 15 Nov 2022 10:04:58 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331855#M100795</guid>
      <dc:creator>ffanali0804</dc:creator>
      <dc:date>2022-11-15T10:04:58Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331856#M100796</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Not necessarily, for example you could:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;SELECT 'myDate = ' ||myDate::date as filter, 'year='||extract(year from myDate)||'/month='||extract(month from myDate)||'/day='||extract(day from myDate)  as path, count(*) expected_rows&lt;/P&gt;&lt;P&gt;FROM myTable&lt;/P&gt;&lt;P&gt;GROUP BY 1,2&lt;/P&gt;&lt;P&gt;ORDER BY 3 desc&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So the same 2 jobs approach is enough just tune your parameters&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then this would give you the filter expression + path that you can pass to the child job. (Prioritizing the big tables will reduce the overall runtime in case of multiple threads used. ) &lt;/P&gt;</description>
      <pubDate>Tue, 15 Nov 2022 11:12:30 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331856#M100796</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-11-15T11:12:30Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331857#M100797</link>
      <description>&lt;P&gt;thank you for your precious help.&lt;/P&gt;&lt;P&gt;I am currently importing several files from different on-prem tables into s3. Every day I load the whole table into one file. I then use these files to aggregate the data and make a new parquet, which I then load onto a redshift table, from which this table reads tableau.&amp;nbsp;&lt;/P&gt;&lt;P&gt;By partitioning the data in this way, however, I could no longer download a single file but N, a solution that I would exclude...&amp;nbsp;&lt;/P&gt;&lt;P&gt;At this point the only solution I can think of is to add to a redshift table the partition of the day just loaded and aggregate the data no longer from the parquet files but from the redshift tables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you confirm this or do you think there is a better solution?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you so much,&lt;/P&gt;&lt;P&gt;Federico &lt;/P&gt;</description>
      <pubDate>Tue, 15 Nov 2022 17:45:02 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331857#M100797</guid>
      <dc:creator>ffanali0804</dc:creator>
      <dc:date>2022-11-15T17:45:02Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331858#M100798</link>
      <description>&lt;P&gt;hello @Balazs Gunics​&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a new question about this post! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Is there a solution to read the entire contents of a partitioned bucket in Talend Data Integration?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;&lt;P&gt;Federico&lt;/P&gt;</description>
      <pubDate>Mon, 12 Dec 2022 15:53:26 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331858#M100798</guid>
      <dc:creator>ffanali0804</dc:creator>
      <dc:date>2022-12-12T15:53:26Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331859#M100799</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I don't recall seeing this request for studio however for other components it does exists.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Components that are done on the new Talend Component Kit Framework do have Input available. (take a look at tAzureADLSGen2 components) That Input basically combines a: StorageList + StorageGet + FileInput into 1 component. &lt;/P&gt;&lt;P&gt;The Get component combines a StorageList + StorageGet and is able to download "folders".&lt;/P&gt;&lt;P&gt;The Input stream content of the file to studio. &lt;/P&gt;&lt;P&gt;For ADLSGen2 we have a feature: TDI-48801 That asks for the following functionality:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;I&gt;An "include subdirectory" checkbox has been added to the tAzureAdlsGen2Get component.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;It would be useful to add the same functionality to the tAzureAdlsGen2Input component as well.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Use-case:&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Partitioned tables are often stored this way:&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;table/year=2022/month=11/day=25/&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;We could easily load every partition by doing a recursive list.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;This requires the schema to be compatible. In case it is we can load all the files from multiple folders.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I beleive that you're looking for the same "include subdirectory" for S3 as well. Unfortunately it will only be possible once we rewrite the entire component family.&lt;/P&gt;&lt;P&gt;As I believe you're using the enterprise version feel free to raise a support case to have this feature request tracked. Feel free to provide a link back to this discussion for them.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For now I think the best you can do is S3List + S3Get into a local folder and then load the data from there.&lt;/P&gt;</description>
      <pubDate>Mon, 12 Dec 2022 16:30:54 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331859#M100799</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-12-12T16:30:54Z</dc:date>
    </item>
    <item>
      <title>Re: Load data from on-prem data source to AWS s3 partition</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331860#M100800</link>
      <description>&lt;P&gt;thank you so much @Balazs Gunics​&amp;nbsp;&lt;/P&gt;&lt;P&gt;so I cannot point to the s3 bucket containing the partitions and read all the files at the same time as AWS Athena would do?&amp;nbsp;&lt;/P&gt;&lt;P&gt;using S3List + S3Get would mean downloading all the files locally and then reading them iteratively to create a single file... I don't think it's that efficient....&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Would using Talend big data solve this problem?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thank you&lt;/P&gt;&lt;P&gt;Federico &lt;/P&gt;</description>
      <pubDate>Mon, 12 Dec 2022 17:06:15 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Load-data-from-on-prem-data-source-to-AWS-s3-partition/m-p/2331860#M100800</guid>
      <dc:creator>ffanali0804</dc:creator>
      <dc:date>2022-12-12T17:06:15Z</dc:date>
    </item>
  </channel>
</rss>

