Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
tldusr
Contributor
Contributor

XML special chars not converted

Hi all,

I have an Excel file like this

0695b00000L2lSFAAZ.jpg

and a Talend Job (TOS 😎 that reads the file and get an output file (XML)

0695b00000L2lSPAAZ.jpg

but the special chars double quote (") and single quote (') are not converted:

0695b00000L2lSjAAJ.jpg

Anyone know how to fix?

Thanks!

Labels (4)
5 Replies
Anonymous
Not applicable

In the scenario you are demonstrating there, there is no need to convert the " or ' chars. That is perfectly acceptable XML. However, if you want to convert all Strings regardless of the necessity for it, you can use some code like this.....

 

System.out.println(TalendString.replaceSpecialCharForXML("A test String with ' and \", > and <"));

 

Dump the above in a tJava component to test it. The "System.out.println" just prints out to the output window. The important bit is the "TalendString.replaceSpecialCharForXML("A test String with ' and \", > and <")" section. This method will replace any special chars in your Strings.

tldusr
Contributor
Contributor
Author

Thanks for the reply.

I am using some software requiring a conversion for " and ' (they are special chars for an XML file)

I knew the solution you are suggesting but it does not work well for my purpose, because as I expected the & will be converted twice.

 

0695b00000LvXXuAAN.jpg 

Anyway, IMHO, I think this is a bug of the tAdvancedOutputXML component.

 

Thanks again but the problem still remain.

Anonymous
Not applicable

I see your problem here, so I have spent some time looking into this to see how I could help. Unfortunately I don't think you will like what I have found. This is not a Talend issue I'm afraid. This is Java and the XML specification. Single quotes and double quotes are perfectly acceptable in XML element values, so they are not automatically converted. Talend is not doing this, this conversion is handled by the Java libraries being used and it appears to be a pretty consistent thing.

 

To prove this, I have built a quick demo that you can try out on your machine. It is simply a job with a tJava and a routine I have hacked together quickly. The tJava code is below.....

 

String text = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>";

System.out.println(text);

System.out.println(routines.XMLUtils.updateXML(text));

 

Essentially what I am doing here is creating a simple XML String as "text". I am then printing it to the Sys.out. Then I am calling the routine I will share next to edit the XML.

 

The routine.....

 

package routines;

 

import java.io.ByteArrayInputStream;

import java.io.IOException;

import java.io.StringWriter;

 

import javax.xml.parsers.DocumentBuilder;

import javax.xml.parsers.DocumentBuilderFactory;

import javax.xml.parsers.ParserConfigurationException;

import javax.xml.transform.Transformer;

import javax.xml.transform.TransformerException;

import javax.xml.transform.TransformerFactory;

import javax.xml.transform.dom.DOMSource;

import javax.xml.transform.stream.StreamResult;

 

import org.w3c.dom.Document;

import org.w3c.dom.Node;

import org.w3c.dom.NodeList;

import org.xml.sax.SAXException;

 

 

public class XMLUtils {

 

 

public static String updateXML(String xml) {

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder docBuilder;

DOMSource domSource = null;

StringWriter writer = null;

StreamResult result = null;

 

try {

docBuilder = docBuilderFactory.newDocumentBuilder();

 

Document document = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));

 

visit(document, 0);

 

domSource = new DOMSource(document);

writer = new StringWriter();

result = new StreamResult(writer);

TransformerFactory tf = TransformerFactory.newInstance();

Transformer transformer = tf.newTransformer();

transformer.transform(domSource, result);

} catch (ParserConfigurationException | SAXException | IOException | TransformerException e) {

// TODO Auto-generated catch block

e.printStackTrace();

 

}

 

return writer.toString();

 

}

 

public static void visit(Node node, int level) {

System.out.println("Name:" + node.getNodeName());

System.out.println("Value:" + node.getNodeValue());

 

NodeList list = node.getChildNodes();

System.out.println("Number:" + list.getLength());

for (int i = 0; i < list.getLength(); i++) {

Node childNode = list.item(i);

 

visit(childNode, level + 1);

 

if (childNode.getNodeName().compareToIgnoreCase("#text") == 0) {

 

Node replacementNode = childNode.cloneNode(true);

replacementNode.setNodeValue("<>' \""); //<-- I'm changing all text values to be <>'" here

Node parentNode = childNode.getParentNode();

parentNode.removeChild(childNode);

parentNode.appendChild(replacementNode);

}

 

}

}

 

}

 

I started building this trying to provide a fix for you, but when I saw it working, I realised that the problem is in the Java XML libraries. You can just copy and paste the routine above into your Studio. Notice the section that says....

 

"//<-- I'm changing all text values to be <>'" here"

 

....this is where I was previously taking the original element value and converting it. This time I am simply setting every element to the same thing. I am not manually converting the <, >, ', or " here. I am just adding that string to each element.

 

When you run this job you will see the original XML printed out like this......

 

<?xml version="1.0" encoding="UTF-8"?><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>

 

Then you will see some debugging outputs, you can ignore those. But when you get to the end you will see this.....

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?><note><to>&lt;&gt;' "</to><from>&lt;&gt;' "</from><heading>&lt;&gt;' "</heading><body>&lt;&gt;' "</body></note>

 

Notice that only the < and > are translated and the ' and " remain as they were originally. This shows that Java does not expect those values to cause problems.

 

Now, I understand that the product you are working with can't work with this. Given what you have seen here, it should be assumed that the issue is with the product you are working with. However, we can still potentially mitigate for this.....but it won't be easy.

 

My suggestion is to build your XML using a tXMLMap and then convert it to a String. Once in String format, you can use String manipulation to find double and single quotes that need altering. Alter those values (be careful not to alter quotes in the XML header and in attributes, etc), then write the converted String to a tFileOutputRaw. This will result in a file that your other application will be able to read.

 

I know this is a pain, but it is possible to do this. You may need to use regular expressions to make this as safe as possible.

 

tldusr
Contributor
Contributor
Author

Hi rhall, maybe you are right: the conversion is correct. Double quote is converted when it is an attribute and single quote not, but the external tool accept this xml as a valid input. Sorry if you wasted your time 😞

 

0695b00000LvpumAAB.jpg

Anonymous
Not applicable

Not a problem at all. I learnt something from looking into this, so my time was not wasted at all 🙂