Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am writing a job to extract content out of word doc & .html files and load to elasticsearch. I am using tTikaExtractor to extract the contents out of the files. I having the following components in my job.
tFileList-->tTikaExractor-->tRowGenerator-->tExtractXML-->tFileOutputDelimited
The process seems to work upto tRowGenerator. However tExtractXML is not fetching any data out. I have the following in the tExtractXML component
loop xpath query = "/html/head/"
Mapping values for title/xpath query are
"title" = "/title"
"body" = "/html/body"
Not sure how to extract creator value from <meta name="dc:creator" content="Tshak"/> in the data
Following is the output coming out of tRowGenerator
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-04-20T14:18:00Z"/>
<meta name="cp:revision" content="4"/>
<meta name="Total-Time" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="metaaragraph-count" content="1"/>
<meta name="meta:word-count" content="11"/>
<meta name="dc:creator" content="Tshak"/>
<meta name="extended-properties:Company" content="Tshak"/>
<meta name="Word-Count" content="11"/>
<meta name="publisher" content="Tshak"/>
<meta name="metaage-count" content="1"/>
<meta name="dcublisher" content="Tshak"/>
<title>Test Extraction</title>
</head>
<body><p><b><u>Help Desk</b></u></p>
<p><a name="_GoBack"/>First paragraph content</p>
<p/>
<p><b><u>Helpdesk Portal</b></u></p>
<p>Second paragraph content</p>
<p/>
<p/>
</body></html>
Appreciate your help!
@Tshak,did you verified below link?
https://help.talend.com/reader/ixBASPZJ7IvqUQVupZwWbg/EFuE5Nul595D24TRwbFnbw
Thanks for your response Manohar. Your suggestion is working! I am able to extract the title and body content from the xml (xhtml).