Skip to main content
Woohoo! Qlik Community has won “Best in Class Community” in the 2024 Khoros Kudos awards!
Announcements
Nov. 20th, Qlik Insider - Lakehouses: Driving the Future of Data & AI - PICK A SESSION
cancel
Showing results for 
Search instead for 
Did you mean: 
mhassinger
Creator
Creator

Removing HTML markup code

I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:

<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">

So it's not as simple as a few replace statements to strip <p> and </p>.

Any ideas?

45 Replies
christian77
Partner - Specialist
Partner - Specialist

Yeap, it is not that simple.

There also a function subfield(FIELD,'>',2)

Then apply subfield(subfield(FIELD,'>',2),'<',1)

With this last thing, you have something.

For this case, the result would be     So it's not as simple as a few replace statements to strip

Somehow you're gonna have to do many tries and maybe several steps in your load.

You could start by spliting your code in lines...

rbecher
MVP
MVP

On way could be to pull out all texts between < > into a mapping table and then use mapsubstring to remove all occurances..

I could work out an example if you would upload a doc.

- Ralf

Astrato.io Head of R&D
Gysbert_Wassenaar

Try using a small vbscript function found here. See attached qvw


talk is cheap, supply exceeds demand
rbecher
MVP
MVP

Based on Gysbert's example, I did it with pure script code using a Mapping Table and MapSubstring() to show the power of QlikView script language and to omit the Macro which should speedup the load:

Data:

LOAD RecNo() as RecNo, Field

FROM comm71215.xls (biff, embedded labels, table is [Sheet1$]);

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(Field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute

Resident Data;

JOIN(Data) LOAD RecNo, MapSubstring('HtmlTag_Map', Field) as Field_Cleansed

Resident Data;

- Ralf

Astrato.io Head of R&D
Not applicable

Thanks Ralf! works great.  I wrote it using an inline - in case anyone is interested:

Test:

LOAD RecNo() as RecNo, html_field inline [

html_field

'<body><p>Hello!</p><p></p><p>DATA TEST DATA TEST </p><p></p><p>TESTING THIS OTHER SEGMENT.</p><p> also this one in lower case</p><p></p><p>lots of different types of html tags <span test="test" __test_macro="testmac" class="test_macro" src="test.gif"></span></p></body>'

];

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(html_field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute

Resident Test;

JOIN(Test) LOAD RecNo, MapSubstring('HtmlTag_Map', html_field) as Field_Cleansed

Resident Test;

Not applicable

Ralf this is fantastic, thank you. 

If my xml datafile has multiple html fields (and I need to do this to all of them), how would you recommend I deal with that in my script in the most efficient way?

Many thanks

rbecher
MVP
MVP

Hi Karen,

could you upload a sample file?

- Ralf

Astrato.io Head of R&D
Not applicable

Hi Ralf,

For some background for you, I am currently connecting my QV to a SharePoint site where I have an input form.  The form includes checkboxes,  single line text fields etc including multiline rich text fields (which generates HTML in the XML feed). 

My script looks something like:

Decisions:

LOAD

ows_ID as ID,

…..

ows_Decision_x0020_Comments as [Decision Comments],

ows_Appeal as Appeal,

ows_Appeal_x0020_Comments as [Appeal Comments],

ows_ReReview as Re-Review,

ows_ReReview_x0020_Comments as [Re-Review Comments],

…..

From <link to sharepoint>

When i display these fields in QV, the comments fields above output all the HTML tags which is not what I want.  Your code is brilliant to remove the html tags from 1 field (e.g. [Decision Comments] but when I have multiple html fields (e.g. Decision Comments, Appeal Comments, Re-Review Comments etc) I would love advice on how best (most efficiently) to deal with that?

at the moment, I have your code following my initial Load:

HTMLTag_Map:

MAPPING LOAD Distinct '<' & TextBetween('<' & SubField([Decision Comments], '<', IterNo()), '<','>') & '>' as HTMLTag, '' as Substitute

Resident Decisions;

Join (Decisions) LOAD ID, MapSubString('HTMLTag_Map', [Decision Comments]) as [Decision Comments Cleansed]

Resident Decisions;

while not ISNULL (SubField([Decision Comments], '<', IterNo()));

But how do i then do the same for the next Comments field e.g. [Appeal Comments] and then [Re-Review Comments}?

Do I do a separate Mapping and join after each field or can I do everything at once?  Apologies if I'm missing the obvious, quite new to QV.

Huge Thanks,

rbecher
MVP
MVP

Karen,

I would suggest to do the procedure for every field the same way, so 3 times in a row:

Load the Field

Load the Map

Join & Strip the HTML Tags

- Ralf

Astrato.io Head of R&D