3 Replies Latest reply: Sep 11, 2013 11:24 AM by Hugo Sheng RSS


    Jamie Hicks

      I have a data flow that is processing 4 years of data. Each year contains approx 12 million rows - so approx 48 million rows across all 4 years. Currently, I have 4 SQL Query inputs (one for each year). Each Year is partitioned 24 times. The partition key is an integer with values 1-24. The integer is created based on the day & month of the date on each record.




        ELSE (EXTRACT(MONTH FROM Date)+12) END AS PartitionField


      Down stream from these inputs, I will need to join to other inputs. My question is: If I use the same partitioning methodology across all inputs, can i be guaranteed that rows from January 1-15 will always be in partition 1? Do I need to sort the query results by date to ensure that my records match throughout the dataflow, or does partition assignment happen automatically based on the value of the integer?


      Thank you,


      Jamie Hicks

        • Re: Partitioning
          Hugo Sheng

          Your best bet is to use a key based partition on the date field and that guarantees that all values with the same date will be in the same partition.  So all records with January 5 will be in the same partition, but all January 6 records could wind up in a separate partition.  However, this guarantees that your join will work correctly as the common values will be co-located across partitions and inputs.


          Another way to view this is to perform a key partition based on the join key(s).

            • Re: Partitioning
              Jamie Hicks

              I'm pretty sure my partitioning methodology is ok. I just need to know if there is a way to guarantee that, in your example, January 5th will always be in partition 5 & January 6th will always be in partition 6. If I always use the same partition field, regardless of input operator, will the same value always be sent to the same partition?