I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:
<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">
So it's not as simple as a few replace statements to strip <p> and </p>.
Yeap, it is not that simple.
There also a function subfield(FIELD,'>',2)
Then apply subfield(subfield(FIELD,'>',2),'<',1)
With this last thing, you have something.
For this case, the result would be So it's not as simple as a few replace statements to strip
Somehow you're gonna have to do many tries and maybe several steps in your load.
You could start by spliting your code in lines...
On way could be to pull out all texts between < > into a mapping table and then use mapsubstring to remove all occurances..
I could work out an example if you would upload a doc.
Based on Gysbert's example, I did it with pure script code using a Mapping Table and MapSubstring() to show the power of QlikView script language and to omit the Macro which should speedup the load:
LOAD RecNo() as RecNo, Field
FROM comm71215.xls (biff, embedded labels, table is [Sheet1$]);
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(Field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
JOIN(Data) LOAD RecNo, MapSubstring('HtmlTag_Map', Field) as Field_Cleansed
Thanks Ralf! works great. I wrote it using an inline - in case anyone is interested:
LOAD RecNo() as RecNo, html_field inline [
'<body><p>Hello!</p><p></p><p>DATA TEST DATA TEST </p><p></p><p>TESTING THIS OTHER SEGMENT.</p><p> also this one in lower case</p><p></p><p>lots of different types of html tags <span test="test" __test_macro="testmac" class="test_macro" src="test.gif"></span></p></body>'
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(html_field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
JOIN(Test) LOAD RecNo, MapSubstring('HtmlTag_Map', html_field) as Field_Cleansed
Ralf this is fantastic, thank you.
If my xml datafile has multiple html fields (and I need to do this to all of them), how would you recommend I deal with that in my script in the most efficient way?
For some background for you, I am currently connecting my QV to a SharePoint site where I have an input form. The form includes checkboxes, single line text fields etc including multiline rich text fields (which generates HTML in the XML feed).
My script looks something like:
ows_ID as ID,
ows_Decision_x0020_Comments as [Decision Comments],
ows_Appeal as Appeal,
ows_Appeal_x0020_Comments as [Appeal Comments],
ows_ReReview as Re-Review,
ows_ReReview_x0020_Comments as [Re-Review Comments],
From <link to sharepoint>
When i display these fields in QV, the comments fields above output all the HTML tags which is not what I want. Your code is brilliant to remove the html tags from 1 field (e.g. [Decision Comments] but when I have multiple html fields (e.g. Decision Comments, Appeal Comments, Re-Review Comments etc) I would love advice on how best (most efficiently) to deal with that?
at the moment, I have your code following my initial Load:
MAPPING LOAD Distinct '<' & TextBetween('<' & SubField([Decision Comments], '<', IterNo()), '<','>') & '>' as HTMLTag, '' as Substitute
Join (Decisions) LOAD ID, MapSubString('HTMLTag_Map', [Decision Comments]) as [Decision Comments Cleansed]
while not ISNULL (SubField([Decision Comments], '<', IterNo()));
But how do i then do the same for the next Comments field e.g. [Appeal Comments] and then [Re-Review Comments}?
Do I do a separate Mapping and join after each field or can I do everything at once? Apologies if I'm missing the obvious, quite new to QV.