Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I am pulling data from an API that returns data in an XML format. I am now trying to put that data in to a CSV.
The XML data is not in a standard format where the filed names are either the xml nodes or an attribute in the node. The way the XML is structured is that the filed names are all listed first under <columnNames> and then each value is under <rows>. The standard tXMLMap and tExtractXMLField do not meet the needs for me. Wondering if anyone else has come across a similar XML and has found a way to make it work.
Appreciate any help on this.
Here's a sample of what the XML looks like.
<?xml version="1.0" encoding="UTF-8"?><root><count>3312</count><name>Service Request Report</name><columnNames>Incident ID</columnNames><columnNames>Reference #</columnNames><columnNames>Subject</columnNames><columnNames>Status</columnNames><columnNames>Source Hierarchy</columnNames><columnNames>Category ID</columnNames><columnNames>Service</columnNames><columnNames>Severity</columnNames><columnNames>District</columnNames><columnNames>Lattitude</columnNames><columnNames>Longitude</columnNames><columnNames>Date Created</columnNames><columnNames>Date Last Updated</columnNames><columnNames>Department</columnNames><columnNames>Escalated</columnNames><columnNames>Year</columnNames><columnNames>Quarter</columnNames><columnNames>Week</columnNames><columnNames>Total Incidents</columnNames><columnNames>No Of Days </columnNames><rows>41191</rows><rows>170824-000522</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:45:30'</rows><rows>'2017-08-26 23:01:36'</rows><rows>Police</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41194</rows><rows>170824-000523</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:53:24'</rows><rows>'2017-08-24 15:53:27'</rows><rows>Information Technology</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41196</rows><rows>170824-000529</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:54:15'</rows><rows>'2017-08-26 23:01:35'</rows><rows>Police</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41198</rows><rows>170824-000531</rows><rows>My San Jose -General Request-Other</rows><rows>Closed</rows><rows>CX Console</rows><rows>Other</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:59:06'</rows><rows>'2017-08-26 23:01:34'</rows><rows>External</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows></root>
That is not nice XML to work with. I would suggest trying to use the tExtractXMLField component (https://help.talend.com/reader/jomWd_GKqAmTZviwG_oxHQ/Pc9RZi6IiWw1Kb~Jf0bE0Q) with some XPath queries.
Having looked at the XML you gave as an example, it looks like the "row" elements holding the values (these should probably be "columns" within "row" elements) are just repeated for several actual rows. I suspect that if you have 10 columns identified above, every 10 "row" elements will likely represent a real row, with the values in the same order as the "column" data above. So if you have 30 "row" elements, that represents 3 actual rows. Not nice to work with, but you can certainly achieve your goal using a tExtractXMLField and XPaths
You got it right. The column names are listed in the beginning in columnNames elements and the data or the values are under rows elements. And I agree this is not a nice XML but this is what I got .
I used the tExtractXMLField and it gives me all the rows and the column names. I am not sure how to create the mapping between the column names and the rows. Do you have any example of how this can be done?
I am guessing this needs working with the code instead of the designer view. I am fairly new to Talend so would need some hand holding here.
Thanks.
Are you writing your job specifically for this XML (ie with the columns you have shown) or for this type of XML layout with varying columns? I suspect it is the second. If it is the second way, there are a couple of things to consider. First is, how have you managed to get your column and row data using the tExtractXMLField component? Did you simply loop on "column" and then loop on "row" (returning a row for every column and every row)? That is probably the approach that I would use for an XML file which isn't fixed (ie the column numbers can change). While looping through the columns I would keep a count of how many rows there are and concatenate each value with the next (with a comma between them). That would sort the CSV header row (if you need on). To do this you can simply use a tMap and a tMap variable where you append the column values to it. These do not drop values between rows, so by the end of the columns, your tMap variable will be the complete CSV header row. The next component would be a tAggregateRow. You would simply group by a hard coded 1 (set up in the tMap) and then use the "LAST" function to return only the last row (the completed Header row).
Once that is done, you can use the same sort of logic with the row data BUT you will need to make sure you create a new row for every n "row" values concatenated. You can do this in a similar way, but your "group key" (used for the tAggregateRow) will need to increment by 1 for every complete row.
I've not necessarily given you everything you will need for this, but I think I have given you enough to work with. The best way to learn Talend is to try things out. See if you can extrapolate from what I have said here and see if you can get closer to your required solution. If you have any problems after trying something, give us a shout
Hi,
Here is a suggestion:
1-Transform your file with a tReplace components to get "proper" records. The expected output is:
?xml version="1.0" encoding="UTF-8"?>
<root>
<count>3312</count>
<name>Service Request Report</name>
<columnNames>Incident ID</columnNames>
<columnNames>Reference #</columnNames>
<columnNames>Subject</columnNames>
<columnNames>Status</columnNames>
<columnNames>Source Hierarchy</columnNames>
<columnNames>Category ID</columnNames>
<columnNames>Service</columnNames>
<columnNames>Severity</columnNames>
<columnNames>District</columnNames>
<columnNames>Lattitude</columnNames>
<columnNames>Longitude</columnNames>
<columnNames>Date Created</columnNames>
<columnNames>Date Last Updated</columnNames>
<columnNames>Department</columnNames>
<columnNames>Escalated</columnNames>
<columnNames>Year</columnNames>
<columnNames>Quarter</columnNames>
<columnNames>Week</columnNames>
<columnNames>Total Incidents</columnNames>
<columnNames>No Of Days </columnNames>
<record>
<rows>41191</rows>
<rows>170824-000522</rows>
<rows>My San Jose -General Request-City</rows>
<rows>Closed</rows>
<rows>CX Console</rows>
<rows>City</rows>
<rows>General Request</rows>
<rows/>
<rows/>
<rows/>
<rows/>
<rows>'2017-08-24 15:45:30'</rows>
<rows>'2017-08-26 23:01:36'</rows>
<rows>Police</rows>
<rows/>
<rows>2017</rows>
<rows>3</rows>
<rows>34</rows>
<rows>1</rows>
<rows>6</rows>
</record>
<record>
<rows>41194</rows>
...
With such a tReplace, it should work:
2-Then with a tXmlMap, you can change the name of your elements.
Here is an example. My input file is:
<Root>
<Record>
<Value>A</Value>
<Value>B</Value>
</Record>
<Record>
<Value>C</Value>
<Value>C</Value>
</Record>
</Root>
I want to get:
<?xml version="1.0" encoding="ISO-8859-15"?>
<Root>
<Record>
<Value1>A</Value1>
<Value2>A</Value2>
</Record>
<Record>
<Value1>C</Value1>
<Value2>C</Value2>
</Record></Root>
Here is my tXmlMap:
Eric
Thanks eric44.
I am unable to get tReplace to work. Here is the error message I get when executing.
Thanks for the suggestion. I am trying eric44 solutions. If that doesn't work I will try your recommendation.
Hi,
What is your input type format for your tReplace ? Document ? If yes, you have to convert it to string with a tConverttype component before your tReplace.
Eric
Hi,
I have a similar issue where I make a SOAP request with tSOAP component but the returned document is not a properly formatted XML document but has the data as comma separated values and I don't know how to extract each field and row. I've tried a tXMPMap and tExtractXMLField but no success.
The fields re First Name, Last Name, Student ID, Email, Request #, Approved, Semester, Description, College and a sample of the returned document is below:
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlnsOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="rollins-accommodate.symplicity.com/ws/report_api.php" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlnsOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<ns1:getReportDataResponse>
<report_data xsi:type="xsd:string"><![CDATA["","","","","","Approved","","",""
"","","","","","Approved","","",""
"","","","","","Approved","Fall 2013","201309",""
"","","","","","Approved","","",""
"","","","","","Approved","Summer 2014","201406",""
"","","","","","Approved","","",""
"","","","","","Approved","","",""
"","","","","","Approved","Fall 2013","201309",""
"","","","","","Approved","Fall 2015","201509",""
"Kristel","Symplicity","00000","kristel@symplicity.edu","A00157-2019","Approved","Fall 2019","Fall 2019",""
"Sonia","General","01058104","sgeneral@school.edu","A00010-2015","Approved","","",""
"Jane","Doe","R01192767","jdoe@school.edu","A00127-2018","Approved","Fall 2018","Fall 2018","CLA"]]></report_data>
</ns1:getReportDataResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Hi @Dobby, there are two steps to this process. First you need to extract the String from the XML element called "report_data". Once you have done that, there are two ways you can approach this. Either dump it out to a file using the tFileOutputRaw component, then read it in again using the tFileInputDelimited. This is the easy way, but may not be the most elegant way.
Alternatively you can extract the "report_data" and connect your component to a tFlowToIterate component. This will store the value in the globalMap. Then connect (via an iterate link) to a tJavaFlex. Create the output schema of the tJavaFlex and then work out the new line/carriage return character(s) used. Retrieve the globalMap value (the "report-data") in the Start Code of the tJavaFlex and using the newline character, add the lines to an ArrayList. Then iterate over the ArrayList (opening the iterator loop...for or while) in the Start Code section. In the Main Code section, split the String up by comma and output each of the values to the correct column you defined in your tJavaFlex schema. The in the End Code section, simply close the loop opened in the Start Code section.
Without Java experience, I would opt for the first method.