Talend Studio feedback for Talend internal dev tea... - Page 2 - Qlik Community

dan2 · ‎2021-07-09

We are starting our adoption of Talend and, in general, it is much more robust that our current Informatica software, however there is some room for improvement. After working through the creation of my first production job, I have the following constructive feedback regarding Talend Studio:

Does not let you auto-organize job components
Workflow lines between components cannot be moved directly, so overlap components when you continue underneath a row of components/subjobs. The only solution I have found is to make the last component in line misaligned with the rest on the same row.
Cannot automatically zoom to fit all components on screen
Components can get stuck in ways where you cannot double-click them
Subjob bubbles cannot easily be resized manually
The red error box around component icons is very hard to click into for error details most of the time
The Code Viewer and Code editor should just be the same thing. It was initially confusing to figure out why line numbers would not show in code viewer after making the settings change
Direct searched for components in the Palette by exact name do not show the component on the top and still can have it hidden in an unexpanded hierarchy
Items are not in alpahbetical order in certain areas of the Repository heirarchy, such as under Metadata. For example, it always takes me a while to find Generic Schemas.
If I try to change the case of a Context Variable, it won't let me until I fully delete the variable since the uniqueness test is triggered improperly in a self-referential manner.
When I retrieve schema on an existing Table schema after there were updates on the database-side, it sometimes says that there were no updates, even afer it does in fact update the schema.
There is no consistency in product offering naming conventions. Even though I am using what is considered Talend Studio. The title at the top of my screen is Talend Cloud Real-Time Big Data Platform (7.3.1.20200219_1130). If I go into Help>About Talend Studio, it also says that it's Talend Cloud. If I download Data Studio while logged in, I get something different than not logged in. It is very confusing.
Regarding the Talend Help Center ribbon, which looks like a browser advertisement bar, it is not apparent how it can be removed. There are almost no components available in the Exchange. Videos does not got to Talend Academy or a full content provider and only shows four videos.
Components do not display a help icon linking to their documentation page.
Many of the Cheat Sheets under help are missing and lead to "Cheat sheet content file * not found."
If you maximize a panel, then minimize it, it fully vanishes. Then, for example, you need to double-click into a Component in the designer to get it back, which will have the panel show up in an obscure place, then clicking maximize again will put the panel back in the correct place.
Clicking Debug Run tab under Run freezes the application for several minutes.
Project files will not import unless you start from the initial splash screen.
I should not need multiple logins to go to the Talend forum and other areas of the Talend website. That was very confusing.
tDBBulkExec has no database Schema override, so the Schema option set in tDBConnection is the only one that can be targeted, so multiple connection are needed to target different schemas.
Filter cannot be used to route multiple output rows. For example, one file has more than one report, identified by the first column of the row. Rows need to be routed to different paths. This does not appear to be easily achievable.
There is a Prejob and Postjob, but no component that I can use for the parallel flows that occur in the main job, so I am forced to use something like tSleep and label it Start_Main_Job_1 in order to make it look nice.
YOUR FORUM DOES NOT HAVE A FEEDBACK AND FEATURE SUGGESTION TOPIC....

dan2 · ‎2021-07-12

Thanks for these thorough explanations. This is a big help and it's nice to see that the dev team is engaged with the user community.

6. Some errors do allow you to click them from the designer and get a little blurb. Agreed that the Code tab is better for this. Double-clicking the error box should simply open the offending line in the code editor.

7. There is a View called Code Viewer, which is separate from the Code tab. I ultimately removed it, but it is worth mentioning that line numbers cannot be added to it and it seems a bit redundant.

8. Even when clicking the magnifying glass or hitting enter, there are certain items that it found that are hidden under their herierarcy. It should automatically expand all related entries when doing a search. Typing into the designer is a neat trick and I'll definitely start using that more.

10. I'd go as far as to call this a bug.

14. Definitely a bit of an Easter Egg. Seems like adding that question mark icon in various places would improve user adoption of the tool. Added bonus if when you hover over the question make, the tooltip hints at the F1 hot-key. Please note, I am comparing this tool to Informatica, which has these bells and whistles.

20. This is more about Schema as a database target vs it's structure. For example, I have a database/schema named fin_staging and it has a table in it called test_table, so the full path to the table is fin_staging.test_table. In the tDBConnection, it has be specify that Schema option to fin_staging. If I do not set it, anything that I do in tDBBulkExec will default to the Redshift "public" database/schema. If I set Table Name in tDBBulkExec to "fin_staging.test_table" and have by tDBConnection set to "xyz", it creates a table named "xyz.fin_staging.test_table" and does not override "xyz" in favor of "fin_staging". If I do not set Schema in the tDBConnection, it will default to "public.fin_staging.test_table". I'm trying to reuse the same connection, but am not able to when targeting multiple database schemas for this reason.

21. I think that I need to test this a bit more. If I'm understanding correctly, I would hit the plus for an additional output and can put the other row schema in since it differs in columns and data types. Then, I'd use an expression filter like row3.record_type="RACT0010"? The tFileInputDelimited requires a defined schema as well, so do I just use a more generic schema to read in the file like one that assumes the maximum amount of potentially available columns and defaults them all to varchar(255)?

22. With the pre-job, I have seen the behavior that the rest of the job will continue on, even if some of the tDBRow steps fail. Do I avoid this by adding tDie to them with Priority "Fatal"? I am seeing a growing trend where people are simply chaining their main job to the Prejob since it avoids the issue that I mention and looks a bit cleaner than having the main job component chain look orphaned, not having a nice looking starting icon like the Prejob and Postjob. Is the best practice for the main job chain/threads to be added with a tParallelize or tPartitioner starter component, even if there is just one main chain/thread?

25. As time permits, I'll try to recreate the issue and test your solution. For my use-case, I might not needs so many mappings anyway provided that the solution for 21 works.

26. Similar to 22, it would be nice to have a component with an icon that signifies the end of the main job, Not necessarily the Postjob, which I use to just close my database and S3 connections, even if there are failures. I'll definitely take a look at the linked tutorial when I can.

I ran into an additional issue yesterday worth mentioning. When establishing my Redshift tDBConnections and then re-using them throughout the job. We have our database seemingly configured aggressively to boot connection that stay idle for 60 seconds, which kills the established connections. The Redshift components don't know what to do when it is their time to run and there is no option to tell them to attempt re-connect. The job gets stuck in an endless loop of saying that the connection was booted and does not throw an error. I had to add Additional parameters: tcpKeepAlive=true&TCPKeepAliveMinutes=1 in order to resolve the issue. It would be nice if some of these pitfall parameters covering the timeout scenarios were standard options when configuring the Redshift connection. When browsing the documentation, it seems that they may have been at one point, but are not there anymore. It took me several hours to solve that issue since I really wanted to use the same connection and commit all the changes at the end of the job.

Thanks,

Dan

dan2 · ‎2021-07-12

I linked the Prejob to the main job in order to ensure that any errors in the Prejob would not allow for the main job to run. Was experiencing tDBRow errors where the main job still ran.
The above screenshots were mostly to demonstrate the overlap issues with the OnSubjobOk lines regardless of its usage. If I wanted to simply continue the Prejob onto another grid row, I'd have to lower the last component on the top instead of simply being able to modify the shape of the line directly like you would in a charting tool.

gjeremy1617088143 · ‎2021-07-13

in the configuration of the studio you can chose line instead of curve for the link design, personnaly i prefer line it's clearer.

gjeremy1617088143 · ‎2021-07-13

hope this trick can help you for line overlaping problems

Anonymous · ‎2021-07-13

Thanks Dan (@Not defined Not defined ) for your further explanations. I have got our product managers taking a look at your list of issues, so this elaboration certainly helps.

20) Oh I see what you mean. I tend to be less concerned with the number of connections I am using unless I am wanting to achieve an atomic transaction across multiple locations. Is there a reason why a connection per schema would be a deal breaker here? However, I do see that this would be frustrating, especially when your login can access multiple schemas simultaneously.

21) Yes, just hit + and add as many outputs as you wish. The row structure or schema can be as you wish for each output. If your file has a different schema for each row, it can get tricky, but you sound like you have the right idea to start with.

22) The tPreJob won't stop the main flow (not attached to the tPreJob or tPostJob) from starting unless there is a major issue in the tPreJob. However, you can control this with a bit of logic at the beginning of the main flow if you like. I suspect this could be handled with a tJava and some RunIf connectors. It's tricky to give you a solution without seeing the problem. But there will be a simple fix to this.

It is only the tPostJob that has guarantees on running.

Your issue with your connection closing and your suggestion seems reasonable. As an alternative, it might be a solution to use the connection in your Redshift component that is selecting/inserting/updating/deleting. This will only be triggered when you have something to do with the DB, so you are less likely to see that timeout. If you are opening a connection and then doing some potentially lengthy calculations over hundreds of thousands/millions of rows, your connection may be dropped in that time. Just a suggestion for another way of tackling this 🙂

dan2 · ‎2021-07-13

20) I prefer to commit all database changes at the end of the job after ensuring that everything staged properly. Spreading those changes across two or more sessions makes it a bit more complicated to manage multiple commits.

22) If I use the tJava approach to learning whether all Prejob items succeeded instead of just linking to the main flow via a chain of OnSubJobOk's, is there a status variable associated with the overall successful completion of the Prejob? Otherwise, it sounds like a lot of extra work to just eliminate a linking line.

Regarding the timeouts, that hopefully won't occur while a connection is actively being used. Per the earlier comment about commits, I'd rather use a single session for the whole job and commit at the end vs. opening a new session for each component step in the flow. This also makes it easier for our DBA's to potentially anticipate/manage resource availability for our process since they can look at the trending of a single session instead of many.

Anonymous · ‎2021-07-14

I came from an Informatica background (a LONG time ago) and I had to change the way that I thought about a mapping (that expression takes me back somewhat). With Talend, you have A LOT more freedom to create, but that does come with some extra work I guess. However, it doesn't take long to get into that way of thinking and that is when you see the doors it opens for you.

For example, you can do anything you can in Java using Talend (apart from GUI stuff of course....unless you try REALLY hard). With regard to 22, this is where you see this. As part of every Talend Job there is a HashMap called globalMap. This can be used similarly to how "global variables" are used in Informatica. So, if you have a way of assessing the success or failure of your tPreJob (or any other SubJob, series of SubJobs...pretty much anything), you can assign the status of that to the globalMap like so (assuming the value is true/false here).....

globalMap.put("tPreJobSuccess", true)

If you want to check it later in the job, you can access it like this (again, this is a boolean so the value stored as an object needs to be cast)....

((Boolean)globalMap.get("tPreJobSuccess"))

Regarding your reasoning for using single commits, that's fair enough. I was just curious.

qiongli · ‎2021-09-08

Hi @Not defined Not defined ,

About '15.Many of the Cheat Sheets under help are missing and lead to "Cheat sheet content file * not found."'

Which Cheat Sheets are missing? I test it on release 7.3.1 with License 'Talend Cloud Real-Time Big Data Platform'. The cheat sheets are as follow and no error like "Cheat sheet content file * not found." :

Talend Studio feedback for Talend internal dev team

Studio

v7.x