Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Help with tExtractXMLField for XHTML

I am writing a job to extract content out of word doc & .html files and load to elasticsearch. I am using tTikaExtractor to extract the contents out of the files.  I having the following components in my job. 

 

tFileList-->tTikaExractor-->tRowGenerator-->tExtractXML-->tFileOutputDelimited

 

The process seems to work upto tRowGenerator. However tExtractXML is not fetching any data out. I have the following in the tExtractXML component

loop xpath query =   "/html/head/"

Mapping values for title/xpath query are

"title" = "/title"

"body" = "/html/body" 

Not sure how to extract creator value from <meta name="dc:creator" content="Tshak"/> in the data

 

Following is the output coming out of tRowGenerator

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-04-20T14:18:00Z"/>
<meta name="cp:revision" content="4"/>
<meta name="Total-Time" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="meta0683p000009MAB6.pngaragraph-count" content="1"/>
<meta name="meta:word-count" content="11"/>
<meta name="dc:creator" content="Tshak"/>
<meta name="extended-properties:Company" content="Tshak"/>
<meta name="Word-Count" content="11"/>
<meta name="publisher" content="Tshak"/>
<meta name="meta0683p000009MAB6.pngage-count" content="1"/>
<meta name="dc0683p000009MAB6.pngublisher" content="Tshak"/>
<title>Test Extraction</title>
</head>
<body><p><b><u>Help Desk</b></u></p>
<p><a name="_GoBack"/>First paragraph content</p>
<p/>
<p><b><u>Helpdesk Portal</b></u></p>
<p>Second paragraph content</p>
<p/>
<p/>
</body></html>

 

Appreciate your help!

Labels (5)
2 Replies
manodwhb
Champion II
Champion II

Anonymous
Not applicable
Author

Thanks for your response Manohar. Your suggestion is working! I am able to extract the title and body content from the xml (xhtml).