Subquery - using where in (select from...

Anonymous — Wed, 13 Apr 2016 08:33:26 GMT

Hi,
Need help in creating the job in the most efficient manner.
---------------------------------------------------------------------------
I have a main table containing 440,000,000 rows with schema similar to this -
Customer ID, Trip ID, Amount, Items Purchased ...
I want to filter rows where Count of trips for a customer id is more than 8.
One way is to use the query - Select * from Main_Table where Customer_ID in (select Customer_ID from Main_Table group by Customer_ID having count(Trip_ID)>=8)
Other way is to create an Aggregated Table with Count_Of_Trips as a column for a given Customer_ID - which simplifies the query -
Select * from Main_Table where Customer_ID in (Select Customer_ID from Aggregated_Table where Count_Of_Trips >=8)
---------------------------------------------------------------------------
Now I take aggregated data in to a tHashOutput component and use it for lookup on the Main Table using tMap Component. This seems time consuming.

What would be the most efficient, minimal time consuming job designs or steps for the above subquery problem?

Re: Subquery - using where in (select from...

Anonymous — Tue, 26 Apr 2016 11:36:01 GMT

Hello
I would suggest you not to cache the data in memory with tHashOutput for a large of data set, this way consumes much memory and reduce the performance. Filter the rows directly in the query will be efficient.

Regards
Shong

topic Re: Subquery - using where <column_name> in (select <column_name> from... in Talend Studio

Subquery - using where in (select from...

Re: Subquery - using where in (select from...