Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Write Table now available in Qlik Cloud Analytics: Read Blog
cancel
Showing results for 
Search instead for 
Did you mean: 
_AnonymousUser
Specialist III
Specialist III

how to create a sequence code for a group of records with the same id

Hello,
here is my task that I am trying to implement in talend.
I have fixed width text file with consumer names, addresses and some other fields. Every unique household (which is defined as combination of address and last name field) has household identified (HHID), so every record on a file has HHID and all the records with the same address and last name of a person have the same HHID.
What I need to do is to assign a sequence code starting at 1 for the first record in the group with the same HHID, then assign 2 to the second record in the group and etc.
My understanding that I should sort the data file first by HHID, when using calculate do a group by, but I am puzzled how to get down back on a record level and generate the sequence.
Labels (2)
34 Replies
Anonymous
Not applicable

@boris
You should really post test data and job definition if you really want the problem solved in order to review it with more precision, it can be a subtle bug somewhere in the stack or memory leak.
See this benchmark, sorting a 3,3 Billion rows, 415 GB dataset
http://blogs.sun.com/aja/entry/talend_s_new_data_processing
@talend
please correct the documentation regarding buffersize of tsort, it is misleading.
thanks
Anonymous
Not applicable

hi all,
hope I have understood what you're looking for.
So first read input file , filter column to catch only HHID , sort it and keep one instance of each hhid with tuniqrow.
and tmap assign sequence for each hhid (here 's my lookup)
main flow : read again (sic) all input file , make an innerjoin in tmap on hhid and write in a file.
the bad thing it's to read all input twice 0683p000009MA9p.png but not find another solution until now !
hope it could help you
PS:
my sequence test
aa;jdk;aaadr1;adr1
abc;hd;abcadr2;adr2
aa;idf;aaadr1;adr1
cc;djf;ccadr1;adr1

number of row : nearly 5 000 000
sort on disk
_AnonymousUser
Specialist III
Specialist III
Author

hi all,
hope I have understood what you're looking for.
So first read input file , filter column to catch only HHID , sort it and keep one instance of each hhid with tuniqrow.
and tmap assign sequence for each hhid (here 's my lookup)

Hi kzone, thank you for the time and effort to help me out!
you got that right and your flow looks great, but what I cannot do is pass sorting. Even if I filter all columns in the start of the flow and keep only hhid, the process still crashes with out of memory error.
HHID is 13 bytes number in my case - all digits no alphas...
Anonymous
Not applicable

Can you post your job code so we may take a look?
right click on the job name in the client. select "export items".
In the dialog, select "archive file" and define the output location and name.
upload the zip file here (it should be very small)
_AnonymousUser
Specialist III
Specialist III
Author

Can you post your job code so we may take a look?
right click on the job name in the client. select "export items".
In the dialog, select "archive file" and define the output location and name.
upload the zip file here (it should be very small)

here you go
http://dl.dropbox.com/u/1351927/testsort_0.1.zip
I removed sequencing and tmap since I could not pass tsort. Crashes all the time when it reaches about 450,000 records. I would send you my test file, but I cannot since it has some proprietary info.
Anonymous
Not applicable

Hello boris, could you provide export items for your job please! Because testsort_0.1.zip is an export job script file.
_AnonymousUser
Specialist III
Specialist III
Author

Hello boris, could you provide export items for your job please! Because testsort_0.1.zip is an export job script file.

Hi gatigossou, sorry about that. please see here
http://dl.dropbox.com/u/1351927/testsort.zip
Anonymous
Not applicable

Hello boris,
This job shows an example of configuration to sort about 5 000 000 of records.
The jvm arguments are setted to -Xms256M -Xmx1548M and the buffer size is setted to 500000
You can download my job here
http://www.talendforge.org/exchange/tos/extension_view.php?eid=312

Best regards,
Anonymous
Not applicable

Hello boris,
This job shows an example of configuration to sort about 5 000 000 of records.
The jvm arguments are setted to -Xms256M -Xmx1548M and the buffer size is setted to 500000
You can download my job here
http://www.talendforge.org/exchange/tos/extension_view.php?eid=312

Best regards,

hurray! I adjusted tsort settings like you said on my test job and it was finally completed without out of memory error. I took a while though on my desktop to sort 6.3 million records - 18 minutes. Still I would be concerned to use that for production jobs - the fact that you need to fine tune every time and guess if it fails or not, does not work that well. Especially if you talking about 200 million records files (which is a typical size for US nationwide consumer listings files, for example).
Thank you for your help! now I need to go back and see if my original task would work (to create sequence number)
Anonymous
Not applicable

Yes sorting is such an important requirement that including some type of heuristic in it (ie. auto tune the buffer... fall back to disk when memory pressure is high....) would be a very welcomed addition from talend.