Complex SQL query in tDBInput

stephbzr · ‎2022-12-06

Hello,

I have input data from a tMap. I need to use some columns (in bold in the query below) to run a SQL query from a tDBInput "Lookup". I have several thousand rows of input so I don't want to use the "Reload at each row" option as it takes too long. I just need to find a method to correctly inject my tMap data into the query and get the result.

The query :

select distinct a.ID_FAM,

into :idFam%, :idAy%, :idOrg1%, :IdProd$

from AY a, FPROD fp, PROD P, ORG O, DCSE D

where a.AY_SS = 'TALEND_COL1'

and a.ID_FAM = fp.ID_FAM

and fp.ID_ORG = a.ID_ORG

and fp.ID_ORG = P.ID_ORG

and fp.ID_PROD = P.ID_PROD

and fp.ID_ORG = O.ID_ORG

and D.DCSE_NOM = 'NOEMIE'

and ((('NOEMIE' <> 'NOEMIE' and D.DCSE_PP = 0) and fp.ID_PROD = 'SANTE') or (('NOEMIE' = 'NOEMIE' or D.DCSE_PP = 1) and (fp.ID_PROD = O.ORG_PROD or P.PROD_PRESTATION_SANTE = 1)))

and fp.ID_FAM = a.ID_FAM

and ((a.AY_DATNAIS = decode('TALEND_COL2', 1, to_date('TALEND_COL2', 'dd/mm/yyyy'), a.AY_DATNAIS)) or (a.AY_DATNAIS2 = 'TALEND_COL2'))

and a.AY_RNG = TALEND_COL3

and exists (select 'a'

from ADC ad

where ad.ID_AY = a.ID_AY

and ad.ID_PROD = fp.ID_PROD

and nvl(ad.id_fprod, fp.id_fprod) = fp.id_fprod

and to_date(TALEND_COL4', 'DD/MM/RR') between ad.ADC_DATDEB and nvl( ad.ADC_DATRAD, sysdate+1))

and to_date('TALEND_COL4', 'DD/MM/RR') between fp.FP_DATADH and nvl( fp.FP_DATRAD, sysdate+1);

A small illustration of what I have started to do :

I moved the conditions requiring input values after the select except for TALEND _COL4 where I wrote the data myself because obviously that's why I need your help.

jlolling · ‎2022-12-06

Whats wrong with the Reload Each Row option? You can use globalMap values to take care the lookup-select provides fast the expected data by instrumenting the where clause.

And BTW, thousands of rows are not a problem for the tMap if you can limit the columns to the really needed. I have used lookups with 20 millions rows and use Load at once!

stephbzr · ‎2022-12-06

Hello @Jan Lolling ,

Thank you very much for your feedback. I would like to benefit from it to have the same results as you in terms of speed. It's true that I'm missing out on an interesting option, but when I tested it my processing time is getting much longer. Which scared me. I don't think I'm using it properly so I'm willing to show you how I do it.

For information, the input columns "output_data2" represent about 4000 rows for the file I am processing. Here is an overview of my input columns

It takes 1 minute in total when I don't add globalMap as in the screenshots in my first comment. To increase the performance of the query a bit, I added the distinct after the select and enabled the advanced settings slider. After adding globalMap, it has been 10 minutes since I ran the job and it still doesn't finish...

jlolling · ‎2022-12-06

One issue is also the tDBInput component for the lookup does not use an external connection. If not the component establish for EACH row a new connection and this slows down a lot.

Take care you put a tDBConnection before the entire subjob with the mentioned tMap.

The next thing is you have to take care the where condition is supported by an index. Your query actually does not allow using an index because the where condition use the given values inside of functions and this prevents using index. I would recommend using a helper table to simplify the select.

E.G. you could collect all repo_datro which will occur in your source and build a temporary table containing all the related records and now you have a very fast table for the lookup.

stephbzr · ‎2022-12-06

Thank you for this information, but I admit that I have difficulty in understanding. Or at least to imagine it. If you have a demonstration or an illustration, that would be great. In any case, I will try to think about it.

Anonymous · ‎2022-12-06

I'm curious, where is this data coming from? Is it all from the same database? If so, why not carry out the join in a single query? A database is built for effective querying of data. If it is all in the same DB, there is no point bringing it out to carry out a join.

If this data is not all in the same DB, then if you do not want to fire a query for every one of your main rows, you can just load all of the data from your sources and join in a tMap. Now, there may be some memory issues doing it this way, but it is worth giving it a try.

I am also confused by your SQL. You have a couple of conditions that do make any sense. For example,.....

('NOEMIE' <> 'NOEMIE' and D.DCSE_PP = 0)

.... will always be false. 'NOEMIE' is always equal to 'NOEMIE'.

Likewise, this will be true all the time....

('NOEMIE' = 'NOEMIE' or D.DCSE_PP = 1)

....since 'NOEMIE' is always equal to 'NOEMIE' and since you are using an "OR" it doesn't matter about the D.DCSE_PP = 1.

These extra conditions that achieve nothing will not help your performance.

jlolling · ‎2022-12-07

Yes, there are some aliases missing. Actually you are right, why the hack there are output records?

Actually I would expect an SQL error because of the ambiguous columns!

jlolling · ‎2022-12-07

Reload Each Row option only make sense if you get this way exactly these records related to the incoming records of the main flow. Or you want to have multiple output records based of one input record.

You have to explain your use case a little bit deeper.

I guess the explanation with the tDBConnection is quite clear or not?

I mean add a tDBConnection to the job and use it before you trigger the subjob containing your mentioned tDBInput . Than reference this tDBConnection from your lookup tDBInput. This will prevent the job from establishing every record a new connection.

The other thing with the temp table can only be more concrete if we understand your use case.

stephbzr · ‎2022-12-07

Hello @Richard Hall and @Jan Lolling,

The data is in a file that I need to process and integrate into a single database. So in Talend, the process is tJavaFlex --> tMap --> tDBOutput. The columns are extracted from the file with substring() in tJavaFlex. To come back to the complex query, the objective is that using the data (columns), execute it. If this one returns a result (to check that a member exists in the base using its identifier, date of birth... to put in the request) then I define the value of another column.

So then what I'll do is :

So for example with this data in input of the tJavaFlex, the query should return no rows (tested in bdd) and return 1 in the other column.

As described above, to inject the first columns into Talend, I removed the rows from the query to place the relevant fields in front of the SELECT (and defined in the schema). I then assigned the data from my tJavaFlex to each of the schema columns (in tMap).

But as you have understood, I don't know how to proceed to do the same with :

and to_date('18/05/1990', 'DD/MM/RR') between ad.ADC_DATDEB and nvl(ad.ADC_DATRAD, sysdate+1))

and to_date('18/05/1990', 'DD/MM/RR') between fp.FP_DATADH and nvl( fp.FP_DATRAD, sysdate+1)

This is what is currently blocking me. This is the first time I've encountered a query like this. Usually when it's a condition like "col1 = value" then I do as with the first columns of this query. But this is different because it is "col1 between and". I can't take col1 and put it in front of SELECT because otherwise what do I do with "between and"?

I thank you for the remark about the confusing query. I should have left the original one, because 'NOEMIE' is also a data of a column in my file. So I put 'NOEMIE' because I am 100% sure that all my rows have this value and it takes away a difficulty to insert a column from my tMap in my query. But this is not the case for TALEND_COL4 where in this case the dates are variable and not fixed.

Anonymous · ‎2022-12-07

OK, lets go through this in stages. Can you test this query (it may require a few tweaks) and replace your query with this one? Your query had some unnecessary sections and wasn't efficient. You even had a cartesian join in there. The DCSE table was not joined to anything and no data was being returned from it.

SELECT distinct a.ID_FAM, a.AY_SS, a.AY_DATNAIS, a.AY_DATNAIS2, a.AY_RNG

FROM AY a INNER JOIN FPROD fp ON a.ID_FAM = fp.ID_FAM AND

a.ID_ORG = fp.ID_ORG

INNER JOIN PROD P ON fp.ID_ORG = P.ID_ORG AND

fp.ID_PROD = P.ID_PROD

INNER JOIN ORG O ON fp.ID_ORG = O.ID_ORG

INNER JOIN ADC ad ON a.ID_AY = ad.ID_AY

WHERE a.AY_SS = 'TALEND_COL1' AND

((fp.ID_PROD = 'SANTE') OR (fp.ID_PROD = O.ORG_PROD OR P.PROD_PRESTATION_SANTE = 1)) AND

((a.AY_DATNAIS = decode('TALEND_COL2', 1, to_date('TALEND_COL2', 'dd/mm/yyyy'), a.AY_DATNAIS)) OR (a.AY_DATNAIS2 = 'TALEND_COL2')) AND

a.AY_RNG = TALEND_COL3 AND

ad.ID_PROD = fp.ID_PROD AND

nvl(ad.id_fprod, fp.id_fprod) = fp.id_fprod AND

to_date(TALEND_COL4', 'DD/MM/RR') between ad.ADC_DATDEB and nvl( ad.ADC_DATRAD, sysdate+1)) AND

to_date('TALEND_COL4', 'DD/MM/RR') between fp.FP_DATADH and nvl( fp.FP_DATRAD, sysdate+1);

This should be all you need. Obviously, I haven't been able to test it. So you may need to tweak it, but not by much. This is based on the SQL you showed in your first post. There are some Talend variables in here (I've kept everything named the same), so you will have to modify this with values to test it. This query should be significantly quicker.

Cloud

Oracle Database

Talend Cloud

Talend Studio

v7.x