Help with tExtractXMLField for XHTML

Anonymous · ‎2018-04-23

I am writing a job to extract content out of word doc & .html files and load to elasticsearch. I am using tTikaExtractor to extract the contents out of the files. I having the following components in my job.

tFileList-->tTikaExractor-->tRowGenerator-->tExtractXML-->tFileOutputDelimited

The process seems to work upto tRowGenerator. However tExtractXML is not fetching any data out. I have the following in the tExtractXML component

loop xpath query = "/html/head/"

Mapping values for title/xpath query are

"title" = "/title"

"body" = "/html/body"

Not sure how to extract creator value from <meta name="dc:creator" content="Tshak"/> in the data

Following is the output coming out of tRowGenerator

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-04-20T14:18:00Z"/>
<meta name="cp:revision" content="4"/>
<meta name="Total-Time" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="metaaragraph-count" content="1"/>
<meta name="meta:word-count" content="11"/>
<meta name="dc:creator" content="Tshak"/>
<meta name="extended-properties:Company" content="Tshak"/>
<meta name="Word-Count" content="11"/>
<meta name="publisher" content="Tshak"/>
<meta name="metaage-count" content="1"/>
<meta name="dcublisher" content="Tshak"/>
<title>Test Extraction</title>
</head>
<body>Help Desk
<a name="_GoBack"/>First paragraph content

Helpdesk Portal
Second paragraph content


</body></html>

Appreciate your help!

manodwhb · ‎2018-04-24

@Tshak,did you verified below link?

https://help.talend.com/reader/ixBASPZJ7IvqUQVupZwWbg/EFuE5Nul595D24TRwbFnbw

Anonymous · ‎2018-04-25

Thanks for your response Manohar. Your suggestion is working! I am able to extract the title and body content from the xml (xhtml).

Big Data

Other

Talend Data Integration

v7.x

XML