Unable to process big file in tXmlMap

Anonymous · ‎2016-04-12

Hello,
I have a text file with more than 10 million records and trying to process it using the tXMLMap to output as xml file. But during this process, found that tXMLMap dumps all the data with in it and it causes the java outofmemoryerror. Could you someone please help me with alternative or any configurations can be done on this to resolve the issue? thanks in advance.

Thanks
Bala

Anonymous · ‎2016-04-12

I assume it's Heap that you're running out of.
It's not always the best option; but it doesn't look like you've got much scope for anything else, here.
You could try increasing -Xmx which can be found on the Run tab under Advanced Settings.

Anonymous · ‎2016-04-12

tal00000 wrote:
I assume it's Heap that you're running out of.
It's not always the best option; but it doesn't look like you've got much scope for anything else, here.
You could try increasing -Xmx which can be found on the Run tab under Advanced Settings.

thank you for the reply. yes. it is the heap that running out of. Anyway already I tried running using the -Xmx and didn't succeed. Also I think setting -Xmx not permanent solution since we may get unpredictable amount data in future. Hence, I believe there could be some other option like setting temp path in tMap.

is there anyway we can split tXMLMap job?
thanks

Anonymous · ‎2016-04-12

I think setting -Xmx is a reasonable solution if you have the memory and you can set it to a value that is high enough to cater for your largest data set and overall performance is acceptable.
I don't know your data; but 10M+ rows in XML is quite a lot. Will the receiving system have similar issues?
It's simple enough to split you're data in to multiple XML files, and then either to retain them as smaller files or use less memory intensive components to reconstruct a larger file.

Anonymous · ‎2016-04-12

tal00000 wrote:
I think setting -Xmx is a reasonable solution if you have the memory and you can set it to a value that is high enough to cater for your largest data set and overall performance is acceptable.
I don't know your data; but 10M+ rows in XML is quite a lot. Will the receiving system have similar issues?
It's simple enough to split you're data in to multiple XML files, and then either to retain them as smaller files or use less memory intensive components to reconstruct a larger file.

Even for split,we need tXMLMap correct? Because the input is text format file. can you please let me know how can we split into smaller file and process them? I am very new to this technology. thanks

Anonymous · ‎2016-04-12

This is tricky since your data is only in that massive file. I take it you understand the XML structure of that file. If that is the case, is there a natural point in that structure to split it? For example, is there a main loop you could split it on? If so, you *may* be able to get round this in a slightly different way. Creating a Document object use more memory than simply creating a String. What you could try is to read the file in using a tFileInputRaw as a String. This will require a lot of memory, but maybe not so much as reading it in as a Document. Then, using simple String parsing techniques to split your file into multiple files (you will need to ensure you wrap the files in suitable opening and closing XML elements to make sure the resultant files can be read as XML. Then simply read the resultant files as you intended to read the file you are having trouble with.

Anonymous · ‎2016-04-12

Yes. I am aware the structure of the XML. However, since the input file is text file, i don't know at what i have to use the other component to split the records. Could you please throw some light on this? Please look the image how my job currently designed. thanks

Anonymous · ‎2016-04-12

I think I misunderstood your request. I was thinking you were reading a large XML file. One thing I would say is that producing what would be a very large XML file is only going to lead to further problems down the line. What is your input format? Is it something like CSV? Something row based? If so, how is the XML structured? Does one data row equal one XML loop? If so, why not chop the input files up by rows (1000, 10,000) and then produce an XML file per file chunk? To split a CSV file up into chunks it's pretty easy (Google it).
Otherwise, if you must try and produce one XML file, you could try using the tFileOutputMSXML component to do this and remove the tXMLMap. You should be able to map the columns directly to the tFileOutputMSXML component which will reduce the memory required....although not by much.

Anonymous · ‎2016-04-13

it is text format and yes it is row based. but it is not one row equal to one XML since we are grouping rows based on member id. I think it could be difficult to split randomly since we need to group members based on member id

Anonymous · ‎2016-04-13

Sorry for not responding sooner. I have been a bit busy today. Your problem has actually inspired a new idea for a component which I will look into building. However, I have some a question and a suggestion based on the answer. The question I have is do you have any grouping item in your data that would enable you to group the records by member id so that there would be a limited number of groups. By that I mean, do you have member groups? If yes, you could use a tMap to point each member group to a different output file. Then you could build several XML files based on that grouping.

Alternatively, if you only have member ids you could do this. Lets say that your member ids go from 1 to 1000 (I know, an easy example, but stay with it and maybe you can extrapolate). If that is the case, then you could use a tMap component to split the data out to 10 files based on grouping the member ids. 1-99, 100-199, 200-299, 300-399, 400-499, 500-599, 600-699, 700-799, 800-899, 900-999 (or 1000). This can be achieved with a simple algorithm using the inbuilt filtering of the tMap output tables. Then you could ensure that you initial file is split into 10 relatively equally sized files, which are grouped as you need them for your final XML data.

This is assuming that it is OK to produce multiple final XML files instead of a massive single file. The reason I am pointing you toward this solution (if possible) is that with this amount of data, it is very hard to do everything you need in memory.
If you need to produce one big XML file, then you could try grouping your data from input file using the tSortRow component. The Advance Settings allow you to "Sort on disk" which should help with memory issues. Then, if you need to translate data, you can use a tMap component's "store on disk" functionality (Advanced tab). Then to load it to an XML file you could use the tAdvancedFileOutputXML component and set the "Generation Mode" (on the Advanced tab) to "Fast with low memory consumption". You will need to check out the documentation for these components to be able to use them as I have suggested, but this might work for you.

There are always a couple of ways to solve problems in Talend. Sometimes it requires you to think about the problem in different ways. I think one of the two ways I have suggested should work for you and there may well be more solutions to your problem. I would be interested in hearing what does solve this for you. Meanwhile I will be looking into creating a component which will allow data to be added to different files based on a key field 🙂

Talend Data Integration

v6.x