Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
See why IDC MarketScape names Qlik a 2025 Leader! Read more
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Help with tExtractXMLField for XHTML

I am writing a job to extract content out of word doc & .html files and load to elasticsearch. I am using tTikaExtractor to extract the contents out of the files.  I having the following components in my job. 

 

tFileList-->tTikaExractor-->tRowGenerator-->tExtractXML-->tFileOutputDelimited

 

The process seems to work upto tRowGenerator. However tExtractXML is not fetching any data out. I have the following in the tExtractXML component

loop xpath query =   "/html/head/"

Mapping values for title/xpath query are

"title" = "/title"

"body" = "/html/body" 

Not sure how to extract creator value from <meta name="dc:creator" content="Tshak"/> in the data

 

Following is the output coming out of tRowGenerator

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-04-20T14:18:00Z"/>
<meta name="cp:revision" content="4"/>
<meta name="Total-Time" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="meta0683p000009MAB6.pngaragraph-count" content="1"/>
<meta name="meta:word-count" content="11"/>
<meta name="dc:creator" content="Tshak"/>
<meta name="extended-properties:Company" content="Tshak"/>
<meta name="Word-Count" content="11"/>
<meta name="publisher" content="Tshak"/>
<meta name="meta0683p000009MAB6.pngage-count" content="1"/>
<meta name="dc0683p000009MAB6.pngublisher" content="Tshak"/>
<title>Test Extraction</title>
</head>
<body><p><b><u>Help Desk</b></u></p>
<p><a name="_GoBack"/>First paragraph content</p>
<p/>
<p><b><u>Helpdesk Portal</b></u></p>
<p>Second paragraph content</p>
<p/>
<p/>
</body></html>

 

Appreciate your help!

Labels (5)
2 Replies
manodwhb
Champion II
Champion II

Anonymous
Not applicable
Author

Thanks for your response Manohar. Your suggestion is working! I am able to extract the title and body content from the xml (xhtml).