Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik GA: Multivariate Time Series in Qlik Predict: Get Details
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

OutOfMemoryError: GC overhead limit exceeded on large XML files

hi,
i am using talend 3.0.4 (mandatory i think because spagobi 3.6 have a talendengine v3.0.4), one job is extracting data using tfileinputxml on sax mode (if i use other modes i get heap out of memory errors wich i think is worse) from large xml files going up to 2gb now but might get bigger in the future)
its quite a simple job ( tfileinputxml ---> tmap (no processing just mapping fields) ---> tmysqloutput ), i even tried raising xmx to 2048 but that didnt help.
i also tried something ive seen here on the forum, i've put 1000 on the nbr of line buffer to tmap and 1000 commit limit for tmysqloutput... this too didnt help.
first : i would like to know if i can use a more recent version with spagobi 3.6 ( im afraid to make big complex jobs to find later that i cant deploy them to the server or that there would be compability problems )
second : if there is a way to solve this problem ( i did have a problem 2 days ago to copy large files using tfilecopy, found out there was a bug and was fixed for later versions, so i downloaded the fixed filecopy.jar and replaced the one i had and it worked like a charm )
thank you.
Labels (3)
3 Replies
Anonymous
Not applicable
Author

Your version of Talend is older than mine, but usually I have some errors with large XML.
One approach we developed here is to split the XML file into smaller chuncks (usually never larger than 64mb) using some java code in a routine and then we process all the files in a sequence.
This allow me to use the common XML parser (way faster than SAX) and allow better XPaths in the schema definition.
The function I use is (only works if your loop tag in the xml doesn't appear anywhere inside the file):
	public static boolean split_file(String filename, int maxpart, String tagname, String roottag, String nsdeclaration){
FileOutputStream fout = null;
PrintStream outstream = null;
Scanner s = null;
int part=0;
int partsize=0;
boolean partnew=true;
String partfile, suffix, token;
partfile = filename.replaceFirst("\\.xml$", "");
try {
s = new Scanner(new FileInputStream(filename),"utf-8");
s.useDelimiter("</" + tagname + ">");
while (s.hasNext()) {
if(partnew){ //begin a new part file
suffix = String.format("_part%04d.xml",part);
fout = new FileOutputStream (partfile + suffix);
outstream = new PrintStream(fout);
if (part>0){ //insert leading tags
outstream.println("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
outstream.println("<" + roottag + " " + nsdeclaration + ">");
}
partsize=0;
partnew=false;
}
//just append tokens
token = s.next();
outstream.print(token);
//if not last chunk append closing tag
if (token.indexOf("</" + roottag + ">")<0) outstream.println("</" + tagname + ">");
partsize += token.length();
if (partsize > maxpart) { //time to wrap it up
outstream.println("</" + roottag + ">");
outstream.close();
outstream = null;
fout.close();
fout = null;
part++;
partnew = true;
}
}
//dump the remaining part to out
outstream.close();
//fout.close();
return true;
} catch (Exception e) {
System.out.println(e.getMessage());
if (s != null) {
s.close();
}
if (outstream != null) {
outstream.close();
}
return false;
}
}
Anonymous
Not applicable
Author

Thank you for this neat code, this might save my project ! luckily my looping tag does not show in the data tags 0683p000009MACn.png
do you suggest i add a new routine and then call it in a tjava componement or create a new componement altogether? i ask this because later i will have to deploy the jobs on the spagobi server talend engine, i dont know what exactly will be deployed !
i'm a bit new to tweaking talend to fit my needs
EDIT : i did create a new routine and called the function from a tjava componement with the help of a tfilelist, it works like a charm. now with xml files of a max size of 60m the parsing works smoothly with no heap or gc exeptions.
as for spagobi deployement i will see that later when i setup the server.
Anonymous
Not applicable
Author

Sorry the delay to anwer, but I usually add a tJava in a tPreJob component.
Thiago