Re: tDBOutputBulk lacks "Output in row mode" - Any... - Qlik Community

stucas · ‎2022-03-02

Quick Synopsis:

I am using an iterate to read files from from S3 in parallel and write/merge to a single file. This target is a tDBOutputBulk component (targeting an AWS Aurora PostgreSQL RDS). I am finding that the concurrent writes to the file often causes issues, with malformed records causing the batch to fail.

The file created by the tDBOutputBulk component is essentially a CSV, and I have tried substituting the component with a tFileDelimited, which has the "Output in row mode" ensuring that records are fully written. This seems to have solved the malformed record issue but causes another. The tFileDelimited has no representation method for "null string" and will always write an empty field as double quotes, rather than skipping with no value as the tDBOutputBulk would. DB errors thus occur when passing "" as a date, even though the tDBBulkExec has been set with "" in the Null String property.

Has anyone else hit this scenario? If so would be grateful for any pointers. Only thing I can now think of doing is creating a mask representing a null value which is written out and detected by the the tDBBulkExec - however, this means that I am changing the data, and *if* my mask was entered in a field it would not transfer. It would also mean that I would have to redefine the types of the schemas which would then cause schema issues.

Anonymous · ‎2022-03-02

Writing to a flat file in a multi-threaded way is not optimal. If you are reading the files from S3 in parallel to improve performance there, why not use an in-memory HSQL Db to write the data to, then write the file in a single thread from that. Then you can use the tDBOutputBulk without issue.

stucas · ‎2022-03-03

Never even occurred to me that it was feasible - and was not suggested by Talend Professional Services in a review we had done when working with them.

Are there any Talend examples for this? I'm happy to try.

Anonymous · ‎2022-03-03

Sorry about the delay in getting back to you. I've been a bit busy this morning and when I read your response I wanted to knock up a very quick example as I couldn't find an example that came close to what you are wishing to do. But these components are not too difficult to get your head around once you've had a play around with them. Essentially, they are just like any other DB components.

So here is a quick example I put together.....

I'm randomly generating data and inserting it into the tHSQLDBOutput. I have configured the tHSQLDBOutput as below....

There are 3 things to keep in mind with this....

Set the "Running Mode" to "HSQLDB in Memory"
Ensure that your "Action on table" is set to "Create table if does not exist". I've put a red box around that one as it is easy to overlook.
Ensure your schema has size and precision values (you will get an error if you do not ensure that).

The tHSQLDBInput should be set up in the same way.

The documentation on the tHSQLDB components can be found here...

https://help.talend.com/r/en-US/8.0/hsqldb/hsqldb

If you have ever used the tHash components, these are a step up from those. As well as storing your data in memory, they enable you to use SQL to return the data (instead of just returning all) and you can use the data from ANYWHERE in your job (even child jobs) so long as the main job is still running.

stucas · ‎2022-03-03

Thanks - don't worry about delay, any help is greatly appreciated.

I've been looking at this, but I having issues and think it might be something do with the scope of subjobs. I've simplified the requirement to take 10 files (from my pc, not S3) and currently load them into HSQL, then do a DBBulkOutputExec.

Simplified job:

When using connections as "in-memory"

This only worked when the iteration was 1. Any change to the iteration value cause a

"java.sql.SQLSyntaxErrorException: object name already exists: " error. The db-out component is set to "create table if not exist" yet, it's the second iteration failing, which shouldn't fail (unless it's rebuilding the db for some reason)

If I try and use "In-Process persistant"

Firstly you can create an metadata connection, but this seems to need to be a fully qualified path, otherwise the job components (seem) to use it as a relative path and end up creating databases in other locations!

Various issues here, but fundamentally 2:

I had the job working (subsequent runs though fail!) yet it leaves a footprint behind that would need to be deleted - I'm assuming that it's down to the "persistence" in the name 🙂 So physically deleting this might be an option
Second iterations also fail with "an object already exists" error, yet this seems even more strange as it's a persistent database, the table was created in the first iteration

It's obvious I'm missing something here, but even when a run works, subsequently added subjobs, like a tHSQLDBRow to query the tables for a count of rows also fail with "Object does not exist"

I'm fighting to understand this, but the Talend components hide a lot and the generated code is a little "esoteric" to follow.

References used:

http://www.hsqldb.org/doc/2.0/guide/running-chapt.html

https://www.datalytyx.com/working-thsqldb-components-talend-studio/

stucas · ‎2022-03-03

looks like we cross posted - am reading reply now.

stucas · ‎2022-03-03

I've simplified my job down to what is your first subjob, but with an iterate - you're writing sequential rows. If you iterate from a tFilelist with anything greater than 1 it fails.

I'm assuming that under the covers, the "Create table if not exists" is actually forcing a straight "create table" on the first and second iteration. Could this be a bug?

Anonymous · ‎2022-03-03

Ah, I suspected that this could be a problem. But there is an easy workaround for this.

Before your subjob which loads the data (it looks like your first subjob), add another one which will be used to create the table in memory. Just use something like an tFixedFlowInput with the same schema (Number of rows set to 0) and connect it to a tHSQLDBOutput set up as I explained before. Then connect that subjob to the subjob shown above and run the job.

The intention of this is to create the table in memory before this is attempted by multiple threads. If you have 4 threads all initiating a write to a table that doesn't exist, they will all try to create it. One of them will complete first and the others will try and fail soon after. This is what you are seeing above. By creating the table in a single thread, it removes this problem.

stucas · ‎2022-03-04

Many thanks.

It's working now - though as I dig more into its use, there are some annoyances, e.g. schema datatype variances between hSQL and Postres - but "c'est la vie" 🙂

As an interesting sidenote/investigation, I've directly substituted the hSQL for a hashMap - that works quite well and would actually seem faster to read/write that hSQL,

Running more decisive tests now.

Once again, thank you.

Anonymous · ‎2022-03-04

When you say a HashMap, do you mean the Java class or the tHash component? The HashMap class is not thread safe. You aren't necessarily going to have massive issues in this use case, but you should be aware of that. I believe the tHash components have the same issue. Just something to be aware of. Talend Studio basically writes Java applications. So you need to be aware of the pros and cons of the Java classes that are used when working in a multi-threaded environment.

tDBOutputBulk lacks "Output in row mode" - Any way round it?

Amazon RDS

amazon S3

Talend Studio

v7.x