Skip to main content
Announcements
July 15, NEW Customer Portal: Initial launch will improve how you submit Support Cases. IMPORTANT DETAILS
cancel
Showing results for 
Search instead for 
Did you mean: 
veryimportantdude
Contributor III
Contributor III

XML iteration is slow

I have an tFileXMLinput and i map it to another XML

the file is quite big (26 MB) with 8000 plus records (products)

Now If i put generation mode to Memory-consuming (Xerces or Dom4j) it is too slow to be working but if I set it to fast (SAX), the xml is not read correctly and majority of data is missing.

Any suggestions and ideas?

thank you

Labels (2)
1 Solution

Accepted Solutions
Anonymous
Not applicable

I have written a quick test job to demonstrate what I mean. You can see it here....

 

0695b00000OC5U1AAL.pngThe tFileInputXML_1 is set to use SAX and is configured as below. It is solely used to split the XML into multiple mini "product" xml files.....

0695b00000OC5UpAAL.pngNotice how I have configured the fields and that the output schema includes a column called "product" of type Document.

 

The next component is the tExtractXMLField component. This is configured as below......

0695b00000OC5VOAA1.pngIt returns all of the data you require (according to your last post), but processes each product sections individually. They are extracted using SAX and then processed individually in memory by the tExtractXMLField component. This uses Dom4j, but it will be storing much smaller chunks of data in memory and therefore should be significantly faster.

 

FYI I took the XML sample you gave me and multiplied it by 3 and built it within the structure your XPaths seemed to suggest.

 

View solution in original post

11 Replies
Anonymous
Not applicable

SAX is fast because it doesn't store the whole XML in memory. It processes elements as it reads them. As such, it doesn't work particularly well with XPath queries especially if they need to look back. This is probably why you are missing data when using SAX.

 

You have to keep in mind that XML is essentially just text. The Complete Works of Shakespeare is only something like 5MB of text. So there is A LOT of data you are processing. How long does it take when using Xerces or Dom4j?

veryimportantdude
Contributor III
Contributor III
Author

It is processing like 3 rows per second. and there are a lot of records (smth like 8k). I did not even let the whole job go trough.... but lets just say too long 🙂 Is there a work around?

veryimportantdude
Contributor III
Contributor III
Author

0695b00000OC3hhAAD.png

Anonymous
Not applicable

There are ways in which you can possibly speed this up, but it's very difficult to assess without seeing the data and without knowing what you need to get from it. As an example, if your file is just a list of looped records (and you only need to extract data from those records without cross-referencing) you could try to use SAX to split the "loops" into individual XML segments (so a loop per row) and then use a tExtractXMLField component to extract the data from each segment.

veryimportantdude
Contributor III
Contributor III
Author

Here is an exapmle of xml and I added some xpaths at the end

 

I am extracting stuff like sapid, ean, picture URL, tax etc etc.

 

<product id="1">

  <idents>

   <ident type="sapid" value="1009913">1009913</ident>

   <ident type="partnumber" value="RBC6">RBC6</ident>

   <ident type="ean" value="731304003281">731304003281</ident>

  </idents>

  <base>

   <name>APC BatteryKit 1000I 1000INET BP SU SUA</name>

   <longname>APC BatteryKit 1000I 1000INET BP SU SUA</longname>

   <vendor value="80000572">APC</vendor>

   <categoryName value="I10002003">Diskovni sistemi in knjižnice / UPS / Baterije</categoryName>

   <category sequence="1">Diskovni sistemi in knjižnice</category>

   <category sequence="2">UPS</category>

   <category sequence="3">Baterije</category>

   <dimensions unit="MM" length="327" width="280" height="201">327 x 280 x 201 (MM)</dimensions>

   <weight unit="KG" value="8.208">8.208 KG</weight>

   <warranty unit="month" value="0">Brez garancije</warranty>

   <countryoforigin value="CN">China</countryoforigin>

   <commodity value="85078000">85078000</commodity>

   <url link="https://also.com/ec/cms5/5820/ProductDetailData.do?prodId=1009913" />

   <minorderqty uom="ea" value="1">1 kos</minorderqty>

  </base>

  <stock>

   <quantity uom="ea" status="1" value="1">1 kos</quantity>

   <backlogs>

    <backlog uom="ea" value="0" isotimestamp="2022 - 02 - 14T10:02: 40.004+00:00" unixtimestamp="1644832960" timezone="Europe/Ljubljana">20220225: Ni naročeno</backlog>

   </backlogs>

  </stock>

  <prices>

   <price type="purchase" currency="EUR" value="191.09">191.09</price>

   <price type="recommended" currency="EUR" value="0" />

   <price type="recommendedwvat" currency="EUR" value="0" />

   <tax type="vat" rate="22" value="42.04" currency="EUR">42.04</tax>

  </prices>

  <pictures>

   <picture sequence="0" view="Glavna slika" width="200" height="150" link="https://actebis-images.com/productimages/530fc204-60f3-41a8-9951-ba3f14d177d2.jpg">Glavna slika</picture>

   <picture sequence="1" view="Product shot Right-angle" width="400" height="300" link="http://cdn.cnetcontent.com/10/37/1037ab71-d709-4e3b-9b46-2d66eda464ab.jpg">Product shot Right-angle</picture>

   <picture sequence="2" view="Product shot Right-angle" width="200" height="150" link="http://cdn.cnetcontent.com/33/e9/33e9534a-7436-48af-be58-443cfe124f5f.jpg">Product shot Right-angle</picture>

   <picture sequence="3" view="Product shot Right-angle" width="640" height="480" link="http://cdn.cnetcontent.com/9c/e4/9ce4f526-8748-4a94-a959-857f8659405a.jpg">Product shot Right-angle</picture>

   <picture sequence="4" view="Product shot Right-angle" width="75" height="75" link="http://cdn.cnetcontent.com/f7/ab/f7ab1f4f-7b14-457d-a959-ead6c112b7c4.jpg">Product shot Right-angle</picture>

  </pictures>

  <attributes>

   <marketingtext lang="SL"></marketingtext>

   <specification lang="SL" name="Število baterij" value="1">1</specification>

   <specification lang="SL" name="Baterija" value="7">Lead-acid</specification>

   <specification lang="SL" name="Oblika baterije" value="9">Plug-in module</specification>

   <specification lang="SL" name="Barva" value="6">Črn</specification>

   <specification lang="SL" name="Zasnovano za" value="8">P/N: DLA1500J, SC1000, SMC1500, SMC15000I, SMC1500C, SMC1500I, SMC15000IC, SMC1500TW, SMT10000C, SMT1000I, SMT1000I-6W, SMT1000I-AR, SMT10000IC, SMT1000TW, SMT1000US, SU1000RM, SU1000RMI, SUA1000-BR, SUA1000ICH, SUA1000ICH-45, SUA1000I-IN, SUA1000J, SUA1000J3W, SUA1000-TW, SUA1500J3W, SUVS1000</specification>

   <specification lang="SL" name="Vrsta naprave" value="10">UPS baterija</specification>

   <specification lang="SL" name="Dimenzije (WxDxH)" value="2">19,6 cm x 15,2 cm x 9,4 cm</specification>

   <specification lang="SL" name="Garancija proizvajalca" value="3">2-letna garancija</specification>

   <specification lang="SL" name="Opis izdelka" value="5">APC Nadomestna baterijska kartuša #6 - UPS baterija - Lead-acid</specification>

   <specification lang="SL" name="Teža" value="4">7,68 kg</specification>

   <document type="User Manual" source="cnet" sequence="1" link="http://cdn.cnetcontent.com/76/0b/760b0306-02e2-4bed-824d-b980dc14b62c.pdf" />

  </attributes>

 </product>

 

And some of the Xpath

 

base/categoryName

/xmlData/product/idents/ident[@type='sapid']

/xmlData/product/base/category[@sequence='1']

/xmlData/product/base/category[@sequence='3']

/xmlData/product/stock/quantity/@value

/xmlData/product/pictures/picture[@sequence='6']/@link

 

veryimportantdude
Contributor III
Contributor III
Author

SAX is not working since it does not read all the data in my xml. as you mentioned with SAX only data is extracted I have lots of cross-references

Anonymous
Not applicable

I have written a quick test job to demonstrate what I mean. You can see it here....

 

0695b00000OC5U1AAL.pngThe tFileInputXML_1 is set to use SAX and is configured as below. It is solely used to split the XML into multiple mini "product" xml files.....

0695b00000OC5UpAAL.pngNotice how I have configured the fields and that the output schema includes a column called "product" of type Document.

 

The next component is the tExtractXMLField component. This is configured as below......

0695b00000OC5VOAA1.pngIt returns all of the data you require (according to your last post), but processes each product sections individually. They are extracted using SAX and then processed individually in memory by the tExtractXMLField component. This uses Dom4j, but it will be storing much smaller chunks of data in memory and therefore should be significantly faster.

 

FYI I took the XML sample you gave me and multiplied it by 3 and built it within the structure your XPaths seemed to suggest.

 

veryimportantdude
Contributor III
Contributor III
Author

when I do as your example I get only 0 rows being processed after tExtractXMLfield. I had a simillar problem yesterday when I was exploring some options. Any ideas what am I doing wrong. I set all the settings as you did. I think 🙂

veryimportantdude
Contributor III
Contributor III
Author

Found the error. It was the Loop XPath query for. I was one level too high.

 

HOpe it works now