Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

[resolved] Split really big xml file in multiple XML files

Hello,
I have to split a 160 Go XML file.
I found a solution in this topic : https://community.talend.com/t5/Design-and-Development/resolved-Can-we-Split-One-XML-File-into-Multi...
But my file is so big (160Go...) that I can't use tFileInputXML: I face an OutOfMemory error.
So I wonder if there is another way to split huge XML files using Talend ? (or maybe a little program that I can run from the tSSH component)

Just for your information this what the XML file looks like:
<ExampleDatabase>
<DatabaseEntry>
A lot of things.
</DatabaseEntry>
<DatabaseEntry>
Other things
</DatabaseEntry>
<DatabaseEntry>
Other things again
</DatabaseEntry>
</ExampleDatabase>

I want to split it between two <DatabaseEntry>.

Thank you.
Labels (3)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Thank you for your help Mbaroudi !
I just find that this morning : http://linux.die.net/man/1/xml_split
This linux command split the file in file of the chosen size and keep the sml structure.
But I think I'm going to try your way Mbaroudi (so the job will be running correctly on Windows if needed)
Pikerman : sorry I'm don't no much about php (create a topic about this).

View solution in original post

4 Replies
Anonymous
Not applicable
Author

In Talend 5.3.1 this component has an advanced option: Generation Mode: Fast and low memory consumption (SAX).
Anonymous
Not applicable
Author

Yes, I know I already use Sax. But even with it, 160 Go XML files are way too big.
Anonymous
Not applicable
Author

Hi,
You can use XSLT to split a huge xml file by Talend tXSLT component :
Source code:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="DatabaseEntry"/>
</xsl:copy>
</xsl:template>
<xsl:template match="DatabaseEntry">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>

(Note, by the way, that because this is based on the identity transform, it works even if header isn't the top-level element.)
You still need to count the DatabaseEntry elements in the source XML, and run the transform repeatedly with the values of Parameters $startPosition and $endPosition that are appropriate for the situation .
Anonymous
Not applicable
Author

Thank you for your help Mbaroudi !
I just find that this morning : http://linux.die.net/man/1/xml_split
This linux command split the file in file of the chosen size and keep the sml structure.
But I think I'm going to try your way Mbaroudi (so the job will be running correctly on Windows if needed)
Pikerman : sorry I'm don't no much about php (create a topic about this).