Re: how to create a sequence code for a group of r... - Qlik Community

_AnonymousUser · ‎2010-07-23

Hello,
here is my task that I am trying to implement in talend.
I have fixed width text file with consumer names, addresses and some other fields. Every unique household (which is defined as combination of address and last name field) has household identified (HHID), so every record on a file has HHID and all the records with the same address and last name of a person have the same HHID.
What I need to do is to assign a sequence code starting at 1 for the first record in the group with the same HHID, then assign 2 to the second record in the group and etc.
My understanding that I should sort the data file first by HHID, when using calculate do a group by, but I am puzzled how to get down back on a record level and generate the sequence.

Anonymous · ‎2010-07-23

Hello,
You can do this task by creating inside tMap a variable and affect to the variable the following value:
var.my_seq=Numeric.sequence(row1.adress+row1.lastname,1,1) will generate one sequence for each value of (row1.adress+row1.lastname)
See screenshot below

Anonymous · ‎2010-07-23

use the builtin sequence function. The first argument is the "name" of the sequence-- use your HHID as a name and you will get a separate sequence for each group of HHID's
in a tmap:
perl:
sequence($row,1,1)
java:
sequence(row1.HHID,1,1)

_AnonymousUser · ‎2010-07-26

Thank you for the prompt response, guys! I tried that and it works on a small file. When I ran it on 8 million records file, my process failed after a few minutes of run with out of memory error.
Not very nice thing if you want to use it for production jobs...I did notice that talend was consuming more and more memory, then allocated all available memory in OS including page file and then crashed. Any ideas how to overcome that?
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
disconnected
at demo5min.boristest_0_1.boristest.tFileInputPositional_1Process(boristest.java:1913)
at demo5min.boristest_0_1.boristest.runJobInTOS(boristest.java:2090)
at demo5min.boristest_0_1.boristest.main(boristest.java:1962)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at org.talend.fileprocess.delimited.RowParser.readRecord(RowParser.java:156)
at demo5min.boristest_0_1.boristest.tFileInputPositional_1Process(boristest.java:1267)

Anonymous · ‎2010-07-26

Post a full screenshot of your job. We ll help you to identify where the problem is coming from

_AnonymousUser · ‎2010-07-26

here you go...I appreciate your help, guys

alevy · ‎2010-07-26

The suggested approach creates a new variable for each HHID so you will run out of memory if there are a large number of unique HHIDs in your data.
A better way might be to replace your tMap with a tFilterColumns to simplify your flow to just (hhid, city, state) before you sort it and then use a tJavaRow to create the sequence along these lines:

if (((String)globalMap.get("PreviousHHID")).equals(input_row.hhid))
  globalMap.put("SeqNum",(Integer)globalMap.get("SeqNum")+1);
else
  globalMap.put("SeqNum",1);
output_row.hhid = input_row.hhid;
output_row.seqnum = (Integer)globalMap.get("SeqNum");
output_row.city = input_row.city;
output_row.state = input_row.state;
globalMap.put("PreviousHHID",input_row.hhid);

Anonymous · ‎2010-07-27

I dont know how many rows you're processing but try the option "sort on disk" for the tSort component

Anonymous · ‎2010-07-27

The suggested approach creates a new variable for each HHID so you will run out of memory if there are a large number of unique HHIDs in your data.
A better way might be to replace your tMap with a tFilterColumns to simplify your flow to just (hhid, city, state) before you sort it and then use a tJavaRow to create the sequence along these lines:
if (((String)globalMap.get("PreviousHHID")).equals(input_row.hhid))
  globalMap.put("SeqNum",(Integer)globalMap.get("SeqNum")+1);
else
  globalMap.put("SeqNum",1);
output_row.hhid = input_row.hhid;
output_row.seqnum = (Integer)globalMap.get("SeqNum");
output_row.city = input_row.city;
output_row.state = input_row.state;
globalMap.put("PreviousHHID",input_row.hhid);

FYI a component in the new versions
tmemorizerows
was created to encapsulate and simplify this behaviour ( ie. increment on field change )

_AnonymousUser · ‎2010-07-27

I dont know how many rows you're processing but try the option "sort on disk" for the tSort component

my sample file is a bit over 4 million records, I have 2Gb of Ram on my desktop and the process crashes consistently when it reaches 400,000 something records.
I just tried sort on disk option and observed interesting results. I moved sort component before Tmap and I also reduced number of columns passed using TfilterColumns as it was suggested by emaxt6.
After I did that, I set tsort buffer to 600,000 bytes and the process crashed again with out of memory error. Then I set it to 1,000,000 bytes and this time it crashed with another error (compete for multiple threads failed). Then I set it to 300,000 bytes it ran beyond my crash point, but it was so terribly slow that I had to kill the process.
Based on what I see, I cannot pass sorting, so I cannot try other things suggested here by alevy and emaxt6.
To be honest I do not like approach suggested with TJavaRow because it looks to me as a "hardcode" which we try to avoid by any means in our company due to huge amount of issues in the past because of hardcoded processes.
I also failed to find any documentation on tmemorizerows - it is just missing in help file.
So I guess my question how to sort the data first - my file is over 4 million rows and it crashes all the time when it reaches a bit over 400,000 records

how to create a sequence code for a group of records with the same id

Other

Talend Data Integration