- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Removing HTML markup code
I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:
<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">
So it's not as simple as a few replace statements to strip <p> and </p>.
Any ideas?
- « Previous Replies
- Next Replies »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeap, it is not that simple.
There also a function subfield(FIELD,'>',2)
Then apply subfield(subfield(FIELD,'>',2),'<',1)
With this last thing, you have something.
For this case, the result would be So it's not as simple as a few replace statements to strip
Somehow you're gonna have to do many tries and maybe several steps in your load.
You could start by spliting your code in lines...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On way could be to pull out all texts between < > into a mapping table and then use mapsubstring to remove all occurances..
I could work out an example if you would upload a doc.
- Ralf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try using a small vbscript function found here. See attached qvw
talk is cheap, supply exceeds demand
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on Gysbert's example, I did it with pure script code using a Mapping Table and MapSubstring() to show the power of QlikView script language and to omit the Macro which should speedup the load:
Data:
LOAD RecNo() as RecNo, Field
FROM comm71215.xls (biff, embedded labels, table is [Sheet1$]);
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(Field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Data;
JOIN(Data) LOAD RecNo, MapSubstring('HtmlTag_Map', Field) as Field_Cleansed
Resident Data;
- Ralf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ralf! works great. I wrote it using an inline - in case anyone is interested:
Test:
LOAD RecNo() as RecNo, html_field inline [
html_field
'<body><p>Hello!</p><p></p><p>DATA TEST DATA TEST </p><p></p><p>TESTING THIS OTHER SEGMENT.</p><p> also this one in lower case</p><p></p><p>lots of different types of html tags <span test="test" __test_macro="testmac" class="test_macro" src="test.gif"></span></p></body>'
];
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(html_field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Test;
JOIN(Test) LOAD RecNo, MapSubstring('HtmlTag_Map', html_field) as Field_Cleansed
Resident Test;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ralf this is fantastic, thank you.
If my xml datafile has multiple html fields (and I need to do this to all of them), how would you recommend I deal with that in my script in the most efficient way?
Many thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Karen,
could you upload a sample file?
- Ralf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ralf,
For some background for you, I am currently connecting my QV to a SharePoint site where I have an input form. The form includes checkboxes, single line text fields etc including multiline rich text fields (which generates HTML in the XML feed).
My script looks something like:
Decisions:
LOAD
ows_ID as ID,
…..
ows_Decision_x0020_Comments as [Decision Comments],
ows_Appeal as Appeal,
ows_Appeal_x0020_Comments as [Appeal Comments],
ows_ReReview as Re-Review,
ows_ReReview_x0020_Comments as [Re-Review Comments],
…..
From <link to sharepoint>
When i display these fields in QV, the comments fields above output all the HTML tags which is not what I want. Your code is brilliant to remove the html tags from 1 field (e.g. [Decision Comments] but when I have multiple html fields (e.g. Decision Comments, Appeal Comments, Re-Review Comments etc) I would love advice on how best (most efficiently) to deal with that?
at the moment, I have your code following my initial Load:
HTMLTag_Map:
MAPPING LOAD Distinct '<' & TextBetween('<' & SubField([Decision Comments], '<', IterNo()), '<','>') & '>' as HTMLTag, '' as Substitute
Resident Decisions;
Join (Decisions) LOAD ID, MapSubString('HTMLTag_Map', [Decision Comments]) as [Decision Comments Cleansed]
Resident Decisions;
while not ISNULL (SubField([Decision Comments], '<', IterNo()));
But how do i then do the same for the next Comments field e.g. [Appeal Comments] and then [Re-Review Comments}?
Do I do a separate Mapping and join after each field or can I do everything at once? Apologies if I'm missing the obvious, quite new to QV.
Huge Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Karen,
I would suggest to do the procedure for every field the same way, so 3 times in a row:
Load the Field
Load the Map
Join & Strip the HTML Tags
- Ralf
- « Previous Replies
- Next Replies »