Spark partition before join

Ask a Question

Hi Quick question , in SPARK -Talend .

In spark In order to join data, Spark needs the data that is to be joined (i.e., the data based on each key) to live on the same partition

If we are using Any key based components like tmap or Join in tsql is it wise to just use these components without partitioning for small files and rely on spark repartitioning the lookup flow based on mainflow.

is there a guide line on when we should necessarily partition vs when we can rely on Spark Framework re partitioning . especually if lookup data is big for broadcast but not too heavy either like > 1GB.

0 Replies

Spark partition before join

Big Data

v7.x