Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I'm looking to create a custom component that takes a primary FLOW input and optionally makes row by row joins from that primary input with batches or ALL rows of a second input. Logic is pretty much identical to a sql join as in the below diagram. This was possible in the Javajet version of custom components by accessing a connection's entire result list and iterating through it entirely or in part to populate a lookup map. Is there a way to accomplish this in the current Java annotations?
The issue iscoming from the fact components are usable in a big data environment. It is a trendy word to say "distributed processing" but the storage of the big data engine is generally "local". This means that the input of the reference data would happen on one worker but not others or you would have some inconsistencies.
In current studio integration only a 1-1 flow is supported so if you need more you need to use other existing components to prepare the input to match your need, it can be as simple as having a component doing the extraction of the data and your component taking a payload looking like:
{ record: { id: ...., ..... }, reference: { id1: {...}, id2: {...}, .... } }
The GroupByKey primitive is generally implemented natively and not done in components directly.
Hello @lli,
It depends a few criteria:
1. if the reference table is always a SQL table the simplest will be to add (if needed) the connection metadata into your component configuration and select the data you need in either a @PostConstruct or a @BeforeGroup - depending if you assume it can change at some point or not - and manage the join yourself in your component which would have a single input
2. Talend Component Kit let the environment - Studio or Big Data engine - manage the "joins", typically a GroupByKey in your case so another component must prepare the data to match your schema before being processing by a kit component
Hi Romain, thanks for your response!
It is certainly not the former case - the component is agnostic to the source of the data and therefore can't rely on creating its own connection and pulling behind the scenes.
Could you elaborate on the GroupByKey and schema matching that you mention in the second point? The schema for the lookup input is well defined and it can be assumed that data flowing into that input is of a known format, the issue I'm having is being able to access the data in bulk per row of the flow input, instead of only simultaneously seeing one flow row and one lookup row.
Component is being designed with Studio in mind if that makes any difference.
Thanks!
Alec
The issue iscoming from the fact components are usable in a big data environment. It is a trendy word to say "distributed processing" but the storage of the big data engine is generally "local". This means that the input of the reference data would happen on one worker but not others or you would have some inconsistencies.
In current studio integration only a 1-1 flow is supported so if you need more you need to use other existing components to prepare the input to match your need, it can be as simple as having a component doing the extraction of the data and your component taking a payload looking like:
{ record: { id: ...., ..... }, reference: { id1: {...}, id2: {...}, .... } }
The GroupByKey primitive is generally implemented natively and not done in components directly.
So in other words managing this in a single component will no longer be supported for custom components starting with Studio version 7? That isn't offered as a criticism, just something I need to be aware of while migrating my javajet components to the new framework
Yes, exactly.