Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
mhassinger
Creator
Creator

Removing HTML markup code

I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:

<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">

So it's not as simple as a few replace statements to strip <p> and </p>.

Any ideas?

45 Replies
chasafd
Contributor II
Contributor II

Thanks, this is very helpful.  Do you know how to have it turn the <br> into a CR/LF?  I want to strip out the tags but not lose some basic formatting.

I've used a nice extension (MinimalisticHtmlTextBox) that works well but only in Full Browser Mode.  I'd like to pull data from SharePoint that works for users who prefer the IE Plugin.

rbecher
MVP
MVP

You could replace '<br>' with '\n' before stripping all other HTML tags.

Astrato.io Head of R&D
bmesolutions
Partner - Contributor II
Partner - Contributor II

This is brilliant cheers Ralf

everest226
Creator III
Creator III

Step 1: In your  extract QVW, add the below VB code under tools Edit module,

change Requested module security to system access and allow system access

Function stripHTML(strHTML)

'Strips the HTML tags from strHTML

  Dim objRegExp, strOutput

  Set objRegExp = New Regexp

  objRegExp.IgnoreCase = True

  objRegExp.Global = True

  objRegExp.Pattern = "<(.|\n)+?>"

  'Replace all HTML tag matches with the empty string

  strOutput = objRegExp.Replace(strHTML, "")

 

  'Replace all < and > with &lt; and &gt;

  strOutput = Replace(strOutput, "<", "&lt;")

  strOutput = Replace(strOutput, ">", "&gt;")

 

  stripHTML = strOutput    'Return the value of strOutput

  Set objRegExp = Nothing

End Function

Step 2: in edit script, after the field

replace(replace(stripHTML([content/properties/Your filed name])

        ,'&#58;',':')

        ,'&#160;',' ') as newcleanfiledname,

Anonymous
Not applicable

What would be the code for fields that come from a database?

Not applicable

Ralf,

Have you run into a situation where there are just too many values in your HTML_Tag_Map table?

The code works fine for the first 70 records I load - which correlates to 118 lines fetched, but then after that, the script just fails for apparently no reason.

Melisa

rbecher
MVP
MVP

Melisa, can you attach an HTML file here to illustrate?

Astrato.io Head of R&D
Not applicable

Ralf,

Thanks for reaching out. It wasn’t the number of records. There was actually some sort of corruption on record 71.

Melisa

cbaqir
Specialist II
Specialist II

This is great, thanks! Can you do a replace() after the stripHTML function?

I have to replace "_" and  .

Anil_Babu_Samineni

You can use this?

stripHTML_Rep = Replace(stripHTML, "_", ".")

Best Anil, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful