Integration with Amazon S3

Anonymous · ‎2014-09-10

Hello,
I am using Amazon's cloud based services:EC2(elastic compute cloud), Redshift and S3(simple storage service) for hosting a data warehouse. In order to carry out this process, I am using Talend Load Jobs which would be used to transfer the files from FTP to Redshift via a three way process:
i) Fetching of data from FTP and downloading it to EC2
ii) Uploading the data to S3 from EC2
iii) Loading data from S3 to Redshift tables
While executing the Talend Load Job, I found out that step i) is executing normally and correspondingly the files are successfully downloaded to EC2 but step ii) is not executed successfully. The error log displays an error stating that "a region must be specified --region or specifying the region in a configuration file or as an environment variable."
The detailed error log is mentioned below:
A region must be specified --region or specifying the region
in a configuration file or as an environment variable.
Alternately, an endpoint can be specified with --endpoint-url
Exception in component tRedshiftRow_1
org.postgresql.util.PSQLException: ERROR: The specified S3 prefix 'Acxiom/adv_oli_20140910.dat' does not exist
Detail:
-----------------------------------------------
error: The specified S3 prefix 'Acxiom/adv_oli_20140910.dat' does not exist
code:      8001
context:
query:     226570
location: s3_utility.cpp:545
process:   padbmaster
-----------------------------------------------
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2096)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1829)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:364)
    at talend_jobs.dup_adv_oli_load_with_error_catching_0_1.Dup_adv_oli_Load_with_error_catching.tRedshiftRow_1Process(Dup_adv_oli_Load_with_error_catching.java:2313)
disconnected
    at talend_jobs.dup_adv_oli_load_with_error_catching_0_1.Dup_adv_oli_Load_with_error_catching.tSystem_1Process(Dup_adv_oli_Load_with_error_catching.java:2199)
    at talend_jobs.dup_adv_oli_load_with_error_catching_0_1.Dup_adv_oli_Load_with_error_catching.tSystem_5Process(Dup_adv_oli_Load_with_error_catching.java:2008)
    at talend_jobs.dup_adv_oli_load_with_error_catching_0_1.Dup_adv_oli_Load_with_error_catching.runJobInTOS(Dup_adv_oli_Load_with_error_catching.java:3442)
    at talend_jobs.dup_adv_oli_load_with_error_catching_0_1.Dup_adv_oli_Load_with_error_catching.main(Dup_adv_oli_Load_with_error_catching.java:3290)
adv_oli_20140910.dat is the name of the file to be transferred from EC2 to S3, and
talend_jobs.dup_adv_oli_load_with_error_catching_0_1 is the name of the Talend 'Load' Job.
Any help would be highly appreciated.
Thanks,
Shobhit

Anonymous · ‎2014-09-11

Hi Shibhit,
Can you show the job design? Not sure how you are using tRedshiftRow_1 component for loading data into Redshift?
Please check example scenario for loading data into redshift
https://help.talend.com/pages/viewpage.action?pageId=19426484#tredshiftinputscenario
Please elaborate more on design side.
Thanks
Vaibhav

Anonymous · ‎2014-09-11

Hi Vaibhav,
Please find attached the screenshot of the job design.

Anonymous · ‎2014-09-11

PFA

Anonymous · ‎2014-09-11

I think here is an issue. S3 is a file storage... tRedShiftRow fires query on database... How you are reading a file using a query? can you show the query which reads file from S3 and inserts data into the redshift?... please look at following scenario...
https://help.talend.com/pages/viewpage.action?pageId=19426484#tredshiftinputscenario
I think you need to download the file to local drive and then insert into the redshift...
Please let me know if I have understood differently.
Thanks
Vaibhav

Anonymous · ‎2014-09-11

The files in raw format are being picked up from the FTP and put on S3 using tsystem_5 component in the screenshot shown. This component downloads the file from FTP to EC2.
The second component copies the data from EC2 to S3 using cp Utility component.
The third component copies the data from S3 to Redshift using the following query for a fixed width raw file:
" create table phase2.adv_oli_increment_raw_temp as select top 1 * from phase2.adv_oli_increment_raw_2;
copy phase2.adv_oli_increment_raw_temp from 's3://WoltersKluwerHealthMR/Acxiom/adv_oli_"+TalendDate.getDate("CCYYMMDD")+".dat' CREDENTIALS ''
COMPUPDATE ON ACCEPTINVCHARS MAXERROR AS 50 REMOVEQUOTES TRUNCATECOLUMNS;
"

Thanks,
Shobhit

Anonymous · ‎2014-09-11

Ok great.. As per error, following path prefix does not exists...
The specified S3 prefix 'Acxiom/adv_oli_20140910.dat' does not exist
Whether the bucket and the prefix exists in the S3? can you check manually? If it exists, then use print statement to check the query...
And test the query using DB client

Vaibhav

Anonymous · ‎2018-09-27

@Shobhit_G

I would like to know, where is the Talend installed? I also have the same requirement. It will help me. Is it on EC2 instance?

Anonymous · ‎2018-11-07

@Shobhit_G

could please post the configuration of 'Copy data from EC2 to S3'? or suggest what to configure in tSystem component.

Thanks & Regards

Hema KG

Java

JDBC

Talend Data Integration

v5.x