Removing HTML markup code - Qlik Community

mhassinger · ‎2013-01-28

I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:

So it's not as simple as a few replace statements to strip <p> and </p>.

Any ideas?

christian77 · ‎2013-01-28

Yeap, it is not that simple.

There also a function subfield(FIELD,'>',2)

Then apply subfield(subfield(FIELD,'>',2),'<',1)

With this last thing, you have something.

For this case, the result would be So it's not as simple as a few replace statements to strip

Somehow you're gonna have to do many tries and maybe several steps in your load.

You could start by spliting your code in lines...

rbecher · ‎2013-01-28

On way could be to pull out all texts between < > into a mapping table and then use mapsubstring to remove all occurances..

I could work out an example if you would upload a doc.

- Ralf

Data & AI Engineer at Orionbelt.ai - a GenAI Semantic Layer Venture, Inventor of Astrato Engine

Gysbert_Wassenaar · ‎2013-01-28

Try using a small vbscript function found here. See attached qvw

talk is cheap, supply exceeds demand

rbecher · ‎2013-02-11

Based on Gysbert's example, I did it with pure script code using a Mapping Table and MapSubstring() to show the power of QlikView script language and to omit the Macro which should speedup the load:

Data:
LOAD RecNo() as RecNo, Field
FROM comm71215.xls (biff, embedded labels, table is [Sheet1$]);
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(Field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Data;
JOIN(Data) LOAD RecNo, MapSubstring('HtmlTag_Map', Field) as Field_Cleansed
Resident Data;

- Ralf

Data & AI Engineer at Orionbelt.ai - a GenAI Semantic Layer Venture, Inventor of Astrato Engine

Report Inappropriate Content · ‎2013-08-21

Thanks Ralf! works great. I wrote it using an inline - in case anyone is interested:

Test:
LOAD RecNo() as RecNo, html_field inline [
html_field
'<body><p>Hello!</p><p></p><p>DATA TEST DATA TEST </p><p></p><p>TESTING THIS OTHER SEGMENT.</p><p> also this one in lower case</p><p></p><p>lots of different types of html tags <span test="test" __test_macro="testmac" class="test_macro" src="test.gif"></span></p></body>'
];

HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(html_field, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Test;

JOIN(Test) LOAD RecNo, MapSubstring('HtmlTag_Map', html_field) as Field_Cleansed
Resident Test;

Report Inappropriate Content · ‎2014-04-17

Ralf this is fantastic, thank you.

If my xml datafile has multiple html fields (and I need to do this to all of them), how would you recommend I deal with that in my script in the most efficient way?

Many thanks

rbecher · ‎2014-04-17

Hi Karen,

could you upload a sample file?

- Ralf

Data & AI Engineer at Orionbelt.ai - a GenAI Semantic Layer Venture, Inventor of Astrato Engine

Report Inappropriate Content · ‎2014-04-17

Hi Ralf,

For some background for you, I am currently connecting my QV to a SharePoint site where I have an input form. The form includes checkboxes, single line text fields etc including multiline rich text fields (which generates HTML in the XML feed).

My script looks something like:

Decisions:

LOAD

ows_ID as ID,

…..

ows_Decision_x0020_Comments as [Decision Comments],

ows_Appeal as Appeal,

ows_Appeal_x0020_Comments as [Appeal Comments],

ows_ReReview as Re-Review,

ows_ReReview_x0020_Comments as [Re-Review Comments],

…..

From <link to sharepoint>

When i display these fields in QV, the comments fields above output all the HTML tags which is not what I want. Your code is brilliant to remove the html tags from 1 field (e.g. [Decision Comments] but when I have multiple html fields (e.g. Decision Comments, Appeal Comments, Re-Review Comments etc) I would love advice on how best (most efficiently) to deal with that?

at the moment, I have your code following my initial Load:

HTMLTag_Map:

MAPPING LOAD Distinct '<' & TextBetween('<' & SubField([Decision Comments], '<', IterNo()), '<','>') & '>' as HTMLTag, '' as Substitute

Resident Decisions;

Join (Decisions) LOAD ID, MapSubString('HTMLTag_Map', [Decision Comments]) as [Decision Comments Cleansed]

Resident Decisions;

while not ISNULL (SubField([Decision Comments], '<', IterNo()));

But how do i then do the same for the next Comments field e.g. [Appeal Comments] and then [Re-Review Comments}?

Do I do a separate Mapping and join after each field or can I do everything at once? Apologies if I'm missing the obvious, quite new to QV.

Huge Thanks,

rbecher · ‎2014-04-17

Karen,

I would suggest to do the procedure for every field the same way, so 3 times in a row:

Load the Field

Load the Map

Join & Strip the HTML Tags

- Ralf

Data & AI Engineer at Orionbelt.ai - a GenAI Semantic Layer Venture, Inventor of Astrato Engine