Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
mhassinger
Creator
Creator

Removing HTML markup code

I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:

<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">

So it's not as simple as a few replace statements to strip <p> and </p>.

Any ideas?

45 Replies
Not applicable

Great, thanks so much Ralf.  I was planning on doing that but wanted to check whether I was missing a trick or building something very inefficient! 

rbecher
MVP
MVP

It could work in another way, too:

Load the Fields

Load the Map for Field1 into Map1

Load the Map for Field2 into Map2

Load the Map for Field3 into Map3

Join/Strip with 3x different MapSubstr()

- Ralf

Astrato.io Head of R&D
Not applicable

I'm getting an error when i try this: MapSubString function takes a constant map as first parameter

I'm loading a table from a stored procedure like so:

CRMActivity:

REPLACE SQL EXEC sp_getdata '$(vEmail)';

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(description, '<', IterNo()),'<','>') & '>' as HtmlTag, '' as Substitute

Resident CRMActivity

While NOT IsNull(SubField(description, '<', IterNo()));

JOIN(CRMActivity) LOAD activityid, description, objectname, ownername, subject, MapSubstring('HtmlTag_Map', description) as Field_Cleansed

Resident CRMActivity;

The fields returned from the stored procedure are activityid, description, objectname, ownername, subject.  I'm trying to clean the description field of html.

Can anyone see what I'm doing wrong?

rbecher
MVP
MVP

Strange error, never got that. Can you load the mapping table into a normal table to see what's in it:

HtmlTag_Map:

LOAD DISTINCT '<' & TextBetween('<' & SubField(description, '<', IterNo()),'<','>') & '>' as HtmlTag, '' as Substitute

Resident CRMActivity

While NOT IsNull(SubField(description, '<', IterNo()));

Astrato.io Head of R&D
Not applicable

I tried this and it appears the mapping table is blank.  Any suggestions?

rbecher
MVP
MVP

Can you give an example what's in table CRMActivity?

Astrato.io Head of R&D
Not applicable

Here's one of the records:

[1000145,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><HTML>

<META NAME="Generator" CONTENT="MS Exchange Server version 14">

<TITLE></TITLE>

</HEAD>

<BODY>

<!-- Converted from text/plain format -->

<BR>

<P><FONT SIZE=2>Hello Jackson,<BR>

<BR>

<BR>

Any idea what the following error means:<BR>

<BR>

<BR>

Error: Disconnected: No supported methods available<BR>

<BR>

</BODY>

</HTML>

, Email, User1, RE: Support Case-30084]

rbecher
MVP
MVP

Maybe we have to purge the line breaks before:

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(PurgeChar(description, chr(13)&chr(10)), '<', IterNo()),'<','>') & '>' as HtmlTag, '' as Substitute

Resident CRMActivity

While NOT IsNull(SubField(description, '<', IterNo()));

JOIN(CRMActivity) LOAD activityid, description, objectname, ownername, subject, MapSubstring('HtmlTag_Map', PurgeChar(description, chr(13)&chr(10))) as Field_Cleansed

Resident CRMActivity;

Astrato.io Head of R&D
Not applicable

Same error:

Error in expression:MapSubString function takes a constant map as first parameter

JOIN(CRMActivity) LOAD activityid, description, objectname, ownername, subject, MapSubstring('HtmlTag_Map', PurgeChar(description, chr(13)&chr(10))) as Field_Cleansed

Resident CRMActivity

Thanks for trying.  I'm pulling my hair out

Not applicable

Just to clear this up.  The mapping works.  Even through it throws that error in the script, the field_cleansed column contains the text without the html.  Its just that it throws that error every time a partial refresh is done.