Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
This blog post will explore three different options for orchestrating Talend Jobs in the Qlik Platform.
Talend Jobs are designed in Studio and ultimately published to Talend Cloud where the Job Artifacts are configured and scheduled as Tasks. One way of orchestrating Tasks is to use Execution Plans. Execution Plans allow you to define a series of sequential steps. Each step can run one or more Tasks in parallel.
All job execution and configuration is delegated to the Task. So which Remote Engine or Remote Engine Cluster is used to run the Task is specified in the Task. The same is true for Task parameters which are ultimately mapped to Context Variables in the Job.
Since Task parameters are static, Execution Plans are also static. Their parameters can only be specified a single time during Task definition. This may be sufficient for some Jobs where for example, a network folder is specified which is then scanned for files on a schedule. But many times a Task would benefit from more dynamic parameters. For example, a Job might receive a url for the location of a file to be processed. This url could vary based on the context of the orchestration. In this case Execution Plans do not work.
Execution Plans are also static in the sense that there is no conditional or looping logic. So a Task cannot be executed in response to any conditions or the output of prior steps. Nor can it be executed a variable number of times.
Execution Plans are also limited because they have no Error Handlers, which is really a form of conditional processing. They do have the ability to Rerun the Plan, but this requires manual interaction in the TMC UI. Programmatic control is possible with the TMC API for Plan Executions but at that point you might as well build it into the Job.
Execution Plans do benefit from being able to run multiple tasks in parallel in a single step. But this is also limited. A frequent design strategy is to scale a process by running multiple instances of the same Task in parallel to spread the workload across multiple servers. Assuming that the data can be partitioned, e.g. by folders, files, or some partitioning key(s) in a database table, multiple instances of a Task can be configured to run concurrently. The degree of concurrency can be managed at the Remote Engine level.
The limitation here is that although Tasks can be run concurrently on the same Remote Engine, Execution Plans do not allow multiple instances of the same Task in the same Step. So running parallel instances of a Task is limited to use by the TMC API.
From an SDLC perspective, Execution Plans make integration testing a bit more difficult since they involve running the Execution Plan in the operations environment (the TMC) rather than the development environment (Studio). Integration Test automation can overcome this by using the TMC API to launch the integration, but debugging can still be cumbersome.
With that said, Execution Plans still offer some benefits. They provide a simple user-friendly interface for an operations team. This supports separation of duties between operations and development for simple static orchestration.
Finally, Execution Plans run inside the TMC control plane in the Qlik Cloud. Therefore, any orchestration metadata is passing outside of the customer’s network. But since Execution Plans are so simple and so statically constrained, there is no risk of customer data being exposed. After all, orchestration is completely static and configuration is completely delegated to the child Tasks.
The table below summarizes the advantages and limitations of Execution Plans.
Advantages |
Limitations |
• Modular Composition • Simple browser UI in TMC • Separation of Duties
|
• Static specification of Tasks • Static configuration of Tasks • Cannot execute the same Task concurrently • Limited orchestration semantics • Multiple development contexts (IDE and browser) • More difficult to create integration tests |
The tRunJob component in Talend Studio is used to run a other “child” jobs from a “parent” job. This promotes modularity and re-use. Since both the parent and child jobs are designed in Studio, they are also easy to test. Child jobs can be tested individually and integration tests can be done by running the parent.
The main limitation of using tRunJob is that child jobs are run in the same JVM process as the parent. This means there is no ability to scale horizontally. However, multi-threading is possible by using the tParallelize component or enabling parallel iterators on components such as tFlowToIterate, tWaitforFile, or tLoop.
Since tRunJob is a component used within a Job, it benefits from the conditional and control flow supported by all Talend Jobs.
While flexible, it must be emphasized that child jobs are specified at design time and hence are static. A parent job can use conditional flow to select the appropriate child job for a particular stage of processing provided that all variations of child jobs at a certain step are known at design time. But when the variation is unknown at design time, or extensions may need to be deployed without rebuilding the job, or there are too many variants then the static nature of tRunJob can be a problem.
Unlike Execution Plans, tRunJob can also specify the parameters for the child job. These can even be derived from the output(s) of previous steps in the parent job. If a parent job wants to simply pass all of its context variables to the child this can be done as well. While not as disciplined as formally observing the contract with the child, this can be convenient.
Child jobs can also return a dataset using the tBufferOutput component. The outbound datastream is returned by the child job and each record can be processed by the parent job. This allows the output of a child job to be iterated over as a collection by a parent job. In some cases each output row may be an input parameter to other child jobs. This is usually done using tFlowToIterate on the output of the first child job and then using a second tRunJob in the iterator.
Another advantage of using tRunJob is that re-use of the child job is done at the design stage, so it does not require publication of the child job to the TMC, since the child job is incorporated into any parent jobs that use it. The result is less clutter in the TMC. Of course, the flip side of this is that parent jobs are monolithic and the child job may be redundantly embedded in many parent jobs.
Talend Studio manages dependencies for you, so changes to the child job can be viewed based on the Impact Analysis. Whenever a parent job is run in the Studio it will detect any changes in the child job and rebuild it if necessary. And if CI/CD is used per standard best practice then the Tasks corresponding to the parent job will also be rebuilt when the child job is modified.
Unlike Execution Plans there is no separation of duties between development and operations because all orchestration is done at design time. However, context variables can be used with control flow to externalize selection of child jobs subject to the static design time constraints above.
Finally, orchestration using tRunJob takes place in the job itself, and as such it runs in the data plane. This means that any configuration logic or passing of parameters to child jobs is staying within the customers network, so there is no chance of data leakage by running in the Cloud.
The table below summarizes the advantages and limitations of tRunJob.
Advantages |
Limitations |
• Modular • Easy to test • Multi-threading support via tParallelize and iterators • Flexible configuration of Context Parameters across Jobs • Conditional and flow control • No extra Task clutter in TMC • Return datasets • Runs in Data Plane
|
• Static specification of Child Jobs • Does not scale horizontally • Monolithic deployment • Limited separation of duties |
The TMC API can be used via the tRESTclient to trigger Tasks in TMC on-demand. This provides an effective means of orchestration based on well defined service contracts that promote modularity and re-use. It also provides loose coupling.
Using this approach scales horizontally across multiple servers using Remote Engine Cluster. Unlike Execution Plans, the same Task can be run concurrently to scale-out partitioned workloads.
Since the API is being used within a Studio job, all Studio conditional and control flows are available.
Like tRunJob, parameters can be passed to the other child services via the API, and the context parameters can be derived dynamically from the output of previous steps.
Unlike tRunJob, because of the loose coupling there is no design time limitation on specifying which child services to call. Which child services are to be run can be specified via configuration by an operator, promoting separation of duties. But they can also be controlled programmatically for truly extensible workflows.
Since specification of orchestrated tasks is loosely coupled and can be dynamically configured, it can also be externalized for clear separation of duties with the operations team.
Since the parent job doing the orchestration is running in the data plane itself, the only data passing outside the customers network is via the well defined interfaces of the TMC API. Child service parameters can be passed as references to urls or other means of indirection so that no business data is sent to the Cloud.
Testing is facilitated by the modular service orientation. Integration tests are also easier to perform than with Execution Plans because the parent job is itself a job which can be debugged in Studio. However, unlike with tRunJob the parent and child jobs are running in separate processes, and hence it is inherently not possible to debug both parent and child processes in the same environment.
The table below summarizes the advantages and limitations of TMC API based approach.
Advantages |
Limitations |
• Modular • Loose Coupling • Horizontal scaling • Conditional and flow control • Flexible configuration of Context Parameters across Jobs • Dynamically specify Tasks • Extensible • Separation of Duties • Runs in Data Plane |
• Somewhat more difficult to create integration tests • More complex TMC API calls |
Comparing the three options is clear that the TMC API approach provides better modularity, looser coupling, more dynamic and extensible behavior, and horizontal scalability. The only limitation is that the jobs must make more complex TMC API calls.
We will cover the TMC API in the next post, and in subsequent posts we will provide sample jobs that show how to encapsulate the complexity of the TMC API as re-usable child jobs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.