how to create a sequence code for a group of recor... - Page 2 - Qlik Community

_AnonymousUser · ‎2010-07-23

Hello,
here is my task that I am trying to implement in talend.
I have fixed width text file with consumer names, addresses and some other fields. Every unique household (which is defined as combination of address and last name field) has household identified (HHID), so every record on a file has HHID and all the records with the same address and last name of a person have the same HHID.
What I need to do is to assign a sequence code starting at 1 for the first record in the group with the same HHID, then assign 2 to the second record in the group and etc.
My understanding that I should sort the data file first by HHID, when using calculate do a group by, but I am puzzled how to get down back on a record level and generate the sequence.

Anonymous · ‎2010-07-27

Do you need sorted *output*? If not - have you tried to remove the sort part (it seems not necessary for only id counter attribution with sequence).
If you need sorted output have you tried to output the data with generated ids to a temporary embedded file database like hsqldb, embedded in talend, creating maybe an index on the column you want to sort beforhand.
Otherwise you should provide some test data sample in order to reproduce the problem.

alevy · ‎2010-07-27

I also failed to find any documentation on tmemorizerows

It is only in v4.1.0 yet to be released.

I have 2Gb of Ram on my desktop

Frankly, you should have at least 4Gb and preferably 8Gb to work with large data sets.

Anonymous · ‎2010-07-28

Yes if under the hood talend don't use only sql cursors and is forced to push data in memory (like the case of sorting), you surely need big amount of ram for storage... for every parsed line there are raw data plus object overhead TIMES the number of rows... go figure...
Alevy is right... if you need such volumes in production do yourself a favour and use a 64bit platform with 64bit JVM...
bye

_AnonymousUser · ‎2010-07-28

this simple job is my way to quickly evaluate an ETL tool - take a file with at least 4-5 million records, sort it , build group sequence code and write back to file.
As you said sorting and sequencing are very heavy operations so you can see easily how the product behaves under the stress load.
Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.
Regarding RAM requirements...I am going to try to run this job on our 64bit 4 core CPU server with 4gb and will report back how it went, but I did install quite a few ETL tools on my desktop (not free and not open sourced though) and they all did the job without crashing - some were terribly slow, some were significantly better.
also I must say that the source file for my job is not that scary big and wide - record length is a bit over 1000 bytes and there are 4,000,000 records.
It is nothing nowadays when you have to deal with terabytes of data.
Thank you all for your help!

Anonymous · ‎2010-07-28

Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.

Here sure I agree with you; Talend is a very flexible tool but surely need more control and refinement regarding memory management... too many thing are delegate to plain java without additional layers... surely Talend is quite naive in this matter (ie no use of NIO memory mapped files, memory compression etc. etc.).
bye

_AnonymousUser · ‎2010-07-28

Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.

Here sure I agree with you; Talend is a very flexible tool but surely need more control and refinement regarding memory management... too many thing are delegate to plain java without additional layers... surely Talend is quite naive in this matter (ie no use of NIO memory mapped files, memory compression etc. etc.).
bye
understood and as I said I like Talend and I want to make it work, but if I have these challenges with a simple job, I do not even want to think how our production jobs will be handled (our typical job deals with 100-200 million records, 10 lookup tables, 50-70 data processing steps like conversion, flagging, deduping etc)
I am going to try to run it on our server, but do you have any tips for me how to manage memory better in Talend?
I understood so far that:
1) I need to limit number of fields passed (already tried that by reducing fields set to hhid field and a few flags only - did not help me too much)
2) turn on sort on disk option and play with buffer size
3) change jvm settings?
anything else I can try?
thanks a bunch!

Anonymous · ‎2010-07-28

You have the tExternalSortedRow component available which use the popular sort binary (see GNU website). It should solve your memory issue but your job will need an external ressource to be started...

Anonymous · ‎2010-07-29

understood and as I said I like Talend and I want to make it work, but if I have these challenges with a simple job, I do not even want to think how our production jobs will be handled (our typical job deals with 100-200 million records, 10 lookup tables, 50-70 data processing steps like conversion, flagging, deduping etc)
I am going to try to run it on our server, but do you have any tips for me how to manage memory better in Talend?
thanks a bunch!

For production purposes I suggest surely a 64bit platform with memory according to the data to be handled.
But despite that your test job MUST NOT crash on you for your test case also with limited ram and on 32bit platform, if your are using sort on disk option due to design of a chunk based sort algorithm. It does, so it can be a bug, so you need to post test data, job and open a ticket.
At Talend: please verify the option "buffer size of external sort" or change the name because it can be thought as the number of bytes (and this is stated in the documentation)... but if I look in the generated code this is the SIZE of an array of objects so it doesn't so directly translate to memory usage.
Sorting is a very common requirement in an ETL tool, so it is very advised to be a first class component in every etl tool... delegating to an external process like unix sort to do such basic thing you lose control, portability, seems an hacked in solution and finally it is quite an admission of defeat for a tool designed to handle data.
If I could give an advice, try to look at http://brie.di.unipi.it/smalltext/ it is a pure java library that implements external sorting with mergesort on text data.
hope it helps

_AnonymousUser · ‎2010-07-29

emaxt6, you are a real asset on this forum!
Well, I tried this morning to sort my sample file on one of our data servers (Xeon 2.33Ghz, 1 cpux4 cores, 4gb ram, 64bit windows 2003 server r2)
Guess what? I am getting exactly the same errors but just a way faster

The weird thing that I can see that there is still 1Gb of ram available when it crashes with heap out of memory error. I tried sort in memory, sort on disk, I changed sort buffer parameter - result is the same.
A few times I saw this error though:
Exception in thread "main" java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodification(Unknown Source)
at java.util.LinkedList$ListItr.next(Unknown Source)
at routines.system.RunStat.sendMessages(RunStat.java:244)
at routines.system.RunStat.stopThreadStat(RunStat.java:228)
at boristest.achsort_0_1.achsort.tFileInputPositional_1Process(achsort.java:1890)
at boristest.achsort_0_1.achsort.runJobInTOS(achsort.java:2066)
at boristest.achsort_0_1.achsort.main(achsort.java:1940)
I even reduced number of fields significantly to pass to Tsort: now it is only 4 fields with 35 bytes of total record lenght. The field that is used for sorting is 13 bytes number.
I have also did a quick search on this forum and it seems I am not the only one who has issues with sorting...

_AnonymousUser · ‎2010-07-29

You have the tExternalSortedRow component available which use the popular sort binary (see GNU website). It should solve your memory issue but your job will need an external ressource to be started...

thank you for the tip, but if I need to use external components for such essential thing in ETL world as sorting, I just do not see much value in this product. Besides I am really concerned about crashes - the sort algorithm can run slow if it reaches the limit of RAM, CPU etc, but should not crash...at least if you position your product for commercial applications

how to create a sequence code for a group of records with the same id

Other

Talend Data Integration