Skip to main content
Announcements
Qlik Connect 2024! Seize endless possibilities! LEARN MORE
cancel
Showing results for 
Search instead for 
Did you mean: 
mhassinger
Creator
Creator

Removing HTML markup code

I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:

<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">

So it's not as simple as a few replace statements to strip <p> and </p>.

Any ideas?

45 Replies
sudhakar_budde
Creator
Creator

Hi Ralf,

I am using the below script to remove the HTML tags from a field from the xlsx file.

While loading, I am getting a general script error! I can't figure out what's wrong in the script or this is something to do with the data in the file? If it is data, then the file has got 30000 records. How can I spot the issue?

Can you please suggest what might be wrong ?

 

//***Script to clean HTML Tags from a xlsx file ***//


Data:
LOAD
 %key,
Description
FROM

[lib://xxx/file_8-1-2019 11-27-24.xlsx]
(ooxml, embedded labels, table is sheet1]);


HtmlTag_Map:
MAPPING
LOAD DISTINCT '<' & TextBetween('<' & SubField(Description, '<', IterNo()),'<','>') & '>' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '<', IterNo()));

CSSTag_Map:
MAPPING
LOAD DISTINCT '{' & TextBetween('{' & SubField(Description, '{', IterNo()),'{','}') & '}' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '{', IterNo()));

ReplaceTag_Map:
mapping
Load
Replace(Replace(ReplaceTag, '$<', '['), '>$', ']') as ReplaceTag
,Substitute ;
load * inline [
ReplaceTag, Substitute
table, ''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
&nbsp;,''
&amp;,''
&quot;,''
&quote;,''
&gt;,''
&lt;,''
&#43;,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,'',
v\:*,''
o\:*,''
u\:*,''
] ;



JOIN(Data)
LOAD
%key,
Trim(MapSubstring('ReplaceTag_Map', MapSubstring('CSSTag_Map',MapSubstring('HtmlTag_Map', Description)))) as Field_Cleansed
Resident Data;

sudhakar_budde
Creator
Creator

Hi Ralf,

I am using the below script to remove the HTML tags from a field from the xlsx file.

While loading, I am getting a general script error! I can't figure out what's wrong in the script or this is something to do with the data in the file? If it is data, then the file has got 30000 records. How can I spot the issue?

Can you please suggest what might be wrong ?

 

//***Script to clean HTML Tags from a xlsx file ***//


Data:
LOAD
 %key,
Description
FROM

[lib://xxx/file_8-1-2019 11-27-24.xlsx]
(ooxml, embedded labels, table is sheet1]);


HtmlTag_Map:
MAPPING 
LOAD DISTINCT '<' & TextBetween('<' & SubField(Description, '<', IterNo()),'<','>') & '>' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '<', IterNo()));

CSSTag_Map:
MAPPING 
LOAD DISTINCT '{' & TextBetween('{' & SubField(Description, '{', IterNo()),'{','}') & '}' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '{', IterNo()));

ReplaceTag_Map:
mapping
Load
Replace(Replace(ReplaceTag, '$<', '['), '>$', ']') as ReplaceTag 
,Substitute ;
load * inline [
ReplaceTag, Substitute
table, ''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
&nbsp;,''
&amp;,''
&quot;,''
&quote;,''
&gt;,''
&lt;,''
&#43;,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,'',
v\:*,''
o\:*,''
u\:*,''
] ;

JOIN(Data) 
LOAD 
%key, 
Trim(MapSubstring('ReplaceTag_Map', MapSubstring('CSSTag_Map',MapSubstring('HtmlTag_Map', Description)))) as Field_Cleansed
Resident Data;

sudhakar_budde
Creator
Creator

Hi Rolfe,

This code is not working for me! I am wondering what I am doing wrong.

Data:

LOAD RecNo() as RecNo,
HTML_description

FROM [$(vPath)RawData\xyz.qvd] (qvd);

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_description, '<'),'<','>') & '>' as HtmlTag, '' as Substitute

Resident Data;

Left JOIN(Data) LOAD RecNo,HTML_description, MapSubstring('HtmlTag_Map', HTML_description) as Field_Cleansed

Resident Data;

Exit Script;

I am also trying  similar code suggested here, but this is working as expected but it will be taking hours to load thousands of rows:

https://community.qlik.com/t5/QlikView-Deployment/strip-HTML-codes-from-a-field-during-load-script/m...

Script:

==================

tab1:

LOAD RecNo() as RecNo,
HTML_description

FROM [$(vPath)RawData\xyz.qvd] (qvd);

For vRow = 1 to NoOfRows('tab1')
Let vID = Peek('RecNo',vRow-1,'tab1');
Let vtext = Peek('HTML_description',vRow-1,'tab1');

do while len(TextBetween(vtext, '<', '>'))>0vtext=Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(vtext, TextBetween(vtext, '<', '>'), ''), '<>', ''),TextBetween(vtext, '{', '}'),''),'{}',''),'&ns;',''), '&nsp;',''),'v\:*',''),'o\:*', ''),'w\:*',''),'..shape','');


loop;

tab2:
load
RecNo,
HTML_description,
'$(vtext)' as text
Resident tab1
Where RecNo=$(vID);

next vRow;

drop Table tab1;

Exit Script;

 

Any suggestions please?

 

Thanks

 

 

sudhakar_budde
Creator
Creator

Hi Rolfe, 

This code is not working for me! I am wondering what I am doing wrong.

Data:

LOAD RecNo() as RecNo,
HTML_description

FROM [$(vPath)RawData\xyz.qvd] (qvd);

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_description, '<'),'<','>') & '>' as HtmlTag, '' as Substitute

Resident Data;

Left JOIN(Data) LOAD RecNo,HTML_description, MapSubstring('HtmlTag_Map', HTML_description) as Field_Cleansed

Resident Data;

Exit Script;

I am also trying  similar code suggested here, but this is working as expected but it will be taking hours to load thousands of rows:

https://community.qlik.com/t5/QlikView-Deployment/strip-HTML-codes-from-a-field-during-load-script/m...

Script:

==================

tab1:

LOAD RecNo() as RecNo,
HTML_description

FROM [$(vPath)RawData\xyz.qvd] (qvd);

For vRow = 1 to NoOfRows('tab1')
Let vID = Peek('RecNo',vRow-1,'tab1');
Let vtext = Peek('HTML_description',vRow-1,'tab1');

do while len(TextBetween(vtext, '<', '>'))>0vtext=Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(vtext, TextBetween(vtext, '<', '>'), ''), '<>', ''),TextBetween(vtext, '{', '}'),''),'{}',''),'&ns;',''), '&nsp;',''),'v\:*',''),'o\:*', ''),'w\:*',''),'..shape','');


loop;

tab2:
load
RecNo,
HTML_description,
'$(vtext)' as text
Resident tab1
Where RecNo=$(vID);

next vRow;

drop Table tab1;

Exit Script;

 

Any suggestions please?

Thanks

SB

 

 

sudhakar_budde
Creator
Creator

Hi Rbecher,

Thanks for your solution that you have provided on this post.

I have used this, but I am getting a 'General Script Error'!!! I couldn't figure out what's going on here. Can you please suggest?

Please see y script here:

 


Email_Message_Table:
LOAD
RecNo() as RecNo
,HTML_description as HTML_Message

Resident Table;

//>>>>>>>>>>>>>>>>>>>>>>>> Mapping Loads <<<<<<<<<<<<<<<<<<<<<<<<<//

HtmlTag_Map:

MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_Message, '<', IterNo()),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Email_Message_Table
While NOT IsNull(SubField(HTML_Message, '<', IterNo()));

Special_Char_Map:

MAPPING LOAD DISTINCT '{' & TextBetween('{' & SubField(HTML_Message, '{', IterNo()),'{','}') & '}' as HtmlTag1, '' as Substitute
Resident Email_Message_Table
While NOT IsNull(SubField(HTML_Message, '{', IterNo()));


ReplaceTag_Map:
mapping Load * inline [
HTML_Message, Substitute
table,''
&#13;,''
&#10;,''
&#13;,''
&#10;,''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
&nbsp;,''
&amp;,'&'
&quot;,''
&quote;,''
&gt;,''
&lt;,''
&#43;,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,''
v\:*,''
o\:*,''
u\:*,''
w\:,''
<!--,''
/ Font Definitions /,''
@font-face,''
/ Style Definitions / ,''
p.MsoNormal,''
i.MsoNormal,''
div.MsoNormal,''
a:link,''
.MsoHyperlink,''
a:visited,''
.MsoHyperlinkFollowed,''
p.msonormal0,''
li.msonormal0,''
div.msonormal0,''
p.emailquote,''
li.emailquote,''
div.emailquote,''
.E-mailStijl20,''
.MsoChpDefault,''
@page WordSection1,''
div.WordSection1,''
-->,''
/ Font Definitions /,''
/ Style Definitions /,''
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.d">,''
-name,''
p.MsoTitle, li.MsoTitle, div.MsoTitle,''
.TitleChar,''
.Heading2Char,''
.EmailStyle20,''
p.Default, li.Default, div.Default,''
/ List Definitions /,''
@list l0,''
:level1,''
:level2,''
:level3,''
:level4,''
:level5,''
:level6,''
:level7,''
:level8,''
:level9,''
] ;


// >>>>>>>>>>>>>>> Transformation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< //

JOIN(Email_Message_Table)

LOAD
RecNo,
MapSubstring('HtmlTag_Map', HTML_Message) as Description
Resident Email_Message_Table;

 

 

 

 

 

NRC
Contributor
Contributor

HI Ralf,

 

Greeting of the day,

 

How we can able to do for multiple columns ?

could you please advise on this.

 

Thanks & Best Regards,

Naga