Splitting a dataset into multiple datasets based on column values

ecurren-lmi — Sat, 16 Nov 2024 01:52:08 GMT

I’ve been racking my brain for a few days now as to how to accomplish this task, and I really need some help.

I have a dataset with a few thousand rows. The schema is this:

SheetColumn

SubjectId

RowId

Value

WorksheetName

WorksheetIndex

WorksheetRow

SortRowId

I sort the data like so:

First by SortRowId. Within that sorted list sort by WorksheetIndex. Within that sorted list sort by WorksheetRow. Finally, within that list sort by SheetColumn.

Now I need to be able to break the data set into multiple data sets, first by splitting up the data sets by SortRowId, then split the SortRowId data sets into a set of new data sets by WorksheetIndex, then by Worksheet Row.

Within each data set I then need to identify all of the RowId values that occur more than once.

For each of the sets of RowIds that occur more than once I need to append an incremental integer to the end of it.

So, for example, if I have the following data – (I have left out those columns that are not part of the calculations in order to fit the data better):

An example:

Let’s say that I have now removed all of the rows that do not contain more than one instance of the RowId within the set of rows within the same WorksheetIndex and we have the data set shown below.

SheetColumn RowId WorksheetIndex WorksheetRow SortRowId

A 40 1 9 123456789

E 40 1 9 123456789

R 40 1 9 123456789

R 94 1 14 987654321

AC 94 1 14 987654321

BE 94 1 14 987654321

BY 94 1 14 987654321

I need to append values to the RowId as below.

SheetColumn RowId WorksheetIndex WorksheetRow SortRowId

D 40_1 1 9 123456789

U 40_2 1 9 123456789

AR 40_3 1 9 123456789

G 94_1 1 14 987654321

N 94_2 1 14 987654321

BE 94_3 1 14 987654321

BY 94_4 1 14 987654321

Again, please note that there will be 10s to 100s of rows between each instance of a repeating RowId value.

Any and all help is appreciated.

Re: Splitting a dataset into multiple datasets based on column values

Anonymous — Sat, 18 Jul 2020 01:12:38 GMT

Call built-in function Numeric.sequence(row4.RowId,1,1) to generate a sequence id for each RowId on tMap, see

Please try and let me know if you have any questions.

Regards

Shong

Re: Splitting a dataset into multiple datasets based on column values

ecurren-lmi — Mon, 20 Jul 2020 19:08:27 GMT

Fantastic!! Thanks Shong.

topic Splitting a dataset into multiple datasets based on column values in Talend Studio

Splitting a dataset into multiple datasets based on column values

Re: Splitting a dataset into multiple datasets based on column values

Re: Splitting a dataset into multiple datasets based on column values