Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I've got a webquery that generates an XML document in the browser. I'm using this as a web file data source in QlikView, and it works as expected, pulling in the XML schema and data. However, one of the fields is full of HTML markup, and I'm not sure the best way to get it all out. Since the XML is generated dynamically on an internet site, it never hits the server file system and so I can't do anything on that end. Also, the HTML is pretty extensive, with lots of things like:
<TD STYLE="BORDER-BOTTOM: black 0.5pt solid; BORDER-LEFT: black 0.5pt solid; BACKGROUND-COLOR: white; WIDTH: 208pt; HEIGHT: 12.75pt;">
So it's not as simple as a few replace statements to strip <p> and </p>.
Any ideas?
Hi Ralf,
I am using the below script to remove the HTML tags from a field from the xlsx file.
While loading, I am getting a general script error! I can't figure out what's wrong in the script or this is something to do with the data in the file? If it is data, then the file has got 30000 records. How can I spot the issue?
Can you please suggest what might be wrong ?
//***Script to clean HTML Tags from a xlsx file ***//
Data:
LOAD
%key,
Description
FROM
[lib://xxx/file_8-1-2019 11-27-24.xlsx]
(ooxml, embedded labels, table is sheet1]);
HtmlTag_Map:
MAPPING
LOAD DISTINCT '<' & TextBetween('<' & SubField(Description, '<', IterNo()),'<','>') & '>' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '<', IterNo()));
CSSTag_Map:
MAPPING
LOAD DISTINCT '{' & TextBetween('{' & SubField(Description, '{', IterNo()),'{','}') & '}' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '{', IterNo()));
ReplaceTag_Map:
mapping
Load
Replace(Replace(ReplaceTag, '$<', '['), '>$', ']') as ReplaceTag
,Substitute ;
load * inline [
ReplaceTag, Substitute
table, ''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
,''
&,''
",''
"e;,''
>,''
<,''
+,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,'',
v\:*,''
o\:*,''
u\:*,''
] ;
JOIN(Data)
LOAD
%key,
Trim(MapSubstring('ReplaceTag_Map', MapSubstring('CSSTag_Map',MapSubstring('HtmlTag_Map', Description)))) as Field_Cleansed
Resident Data;
Hi Ralf,
I am using the below script to remove the HTML tags from a field from the xlsx file.
While loading, I am getting a general script error! I can't figure out what's wrong in the script or this is something to do with the data in the file? If it is data, then the file has got 30000 records. How can I spot the issue?
Can you please suggest what might be wrong ?
//***Script to clean HTML Tags from a xlsx file ***//
Data:
LOAD
%key,
Description
FROM
[lib://xxx/file_8-1-2019 11-27-24.xlsx]
(ooxml, embedded labels, table is sheet1]);
HtmlTag_Map:
MAPPING
LOAD DISTINCT '<' & TextBetween('<' & SubField(Description, '<', IterNo()),'<','>') & '>' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '<', IterNo()));
CSSTag_Map:
MAPPING
LOAD DISTINCT '{' & TextBetween('{' & SubField(Description, '{', IterNo()),'{','}') & '}' as HtmlTag
, '' as Substitute
Resident Data
While NOT IsNull(SubField(Description, '{', IterNo()));
ReplaceTag_Map:
mapping
Load
Replace(Replace(ReplaceTag, '$<', '['), '>$', ']') as ReplaceTag
,Substitute ;
load * inline [
ReplaceTag, Substitute
table, ''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
,''
&,''
",''
"e;,''
>,''
<,''
+,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,'',
v\:*,''
o\:*,''
u\:*,''
] ;
JOIN(Data)
LOAD
%key,
Trim(MapSubstring('ReplaceTag_Map', MapSubstring('CSSTag_Map',MapSubstring('HtmlTag_Map', Description)))) as Field_Cleansed
Resident Data;
Hi Rolfe,
This code is not working for me! I am wondering what I am doing wrong.
Data:
LOAD RecNo() as RecNo,
HTML_description
FROM [$(vPath)RawData\xyz.qvd] (qvd);
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_description, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Data;
Left JOIN(Data) LOAD RecNo,HTML_description, MapSubstring('HtmlTag_Map', HTML_description) as Field_Cleansed
Resident Data;
Exit Script;
I am also trying similar code suggested here, but this is working as expected but it will be taking hours to load thousands of rows:
Script:
==================
tab1:
LOAD RecNo() as RecNo,
HTML_description
FROM [$(vPath)RawData\xyz.qvd] (qvd);
For vRow = 1 to NoOfRows('tab1')
Let vID = Peek('RecNo',vRow-1,'tab1');
Let vtext = Peek('HTML_description',vRow-1,'tab1');
do while len(TextBetween(vtext, '<', '>'))>0vtext=Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(vtext, TextBetween(vtext, '<', '>'), ''), '<>', ''),TextBetween(vtext, '{', '}'),''),'{}',''),'&ns;',''), '&nsp;',''),'v\:*',''),'o\:*', ''),'w\:*',''),'..shape','');
loop;
tab2:
load
RecNo,
HTML_description,
'$(vtext)' as text
Resident tab1
Where RecNo=$(vID);
next vRow;
drop Table tab1;
Exit Script;
Any suggestions please?
Thanks
Hi Rolfe,
This code is not working for me! I am wondering what I am doing wrong.
Data:
LOAD RecNo() as RecNo,
HTML_description
FROM [$(vPath)RawData\xyz.qvd] (qvd);
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_description, '<'),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Data;
Left JOIN(Data) LOAD RecNo,HTML_description, MapSubstring('HtmlTag_Map', HTML_description) as Field_Cleansed
Resident Data;
Exit Script;
I am also trying similar code suggested here, but this is working as expected but it will be taking hours to load thousands of rows:
Script:
==================
tab1:
LOAD RecNo() as RecNo,
HTML_description
FROM [$(vPath)RawData\xyz.qvd] (qvd);
For vRow = 1 to NoOfRows('tab1')
Let vID = Peek('RecNo',vRow-1,'tab1');
Let vtext = Peek('HTML_description',vRow-1,'tab1');
do while len(TextBetween(vtext, '<', '>'))>0vtext=Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(vtext, TextBetween(vtext, '<', '>'), ''), '<>', ''),TextBetween(vtext, '{', '}'),''),'{}',''),'&ns;',''), '&nsp;',''),'v\:*',''),'o\:*', ''),'w\:*',''),'..shape','');
loop;
tab2:
load
RecNo,
HTML_description,
'$(vtext)' as text
Resident tab1
Where RecNo=$(vID);
next vRow;
drop Table tab1;
Exit Script;
Any suggestions please?
Thanks
SB
Hi Rbecher,
Thanks for your solution that you have provided on this post.
I have used this, but I am getting a 'General Script Error'!!! I couldn't figure out what's going on here. Can you please suggest?
Please see y script here:
Email_Message_Table:
LOAD
RecNo() as RecNo
,HTML_description as HTML_Message
Resident Table;
//>>>>>>>>>>>>>>>>>>>>>>>> Mapping Loads <<<<<<<<<<<<<<<<<<<<<<<<<//
HtmlTag_Map:
MAPPING LOAD DISTINCT '<' & TextBetween('<' & SubField(HTML_Message, '<', IterNo()),'<','>') & '>' as HtmlTag, '' as Substitute
Resident Email_Message_Table
While NOT IsNull(SubField(HTML_Message, '<', IterNo()));
Special_Char_Map:
MAPPING LOAD DISTINCT '{' & TextBetween('{' & SubField(HTML_Message, '{', IterNo()),'{','}') & '}' as HtmlTag1, '' as Substitute
Resident Email_Message_Table
While NOT IsNull(SubField(HTML_Message, '{', IterNo()));
ReplaceTag_Map:
mapping Load * inline [
HTML_Message, Substitute
table,''
,''
,''
,''
,''
body, ''
td,''
.shape,''
$<dir=rtl>$,''
.directional_text_wrapper ,''
,''
&,'&'
",''
"e;,''
>,''
<,''
+,''
div.outlookmessageheader, ''
print{hs,''
.mozemailheaders,''
mailcontainerbody,''
*.location,''
.rtl,''
.info,''
.h2,''
h2,''
.location address, ''
.location, ''
.header, ''
.file,''
.createTD,''
.updateTD,''
.var,''
.mapicon,''
#entryInfo,''
.adr,''
a:hover adress,''
a:hover,''
span,''
*,''
file-name,''
$*,''
v\:*,''
o\:*,''
u\:*,''
w\:,''
<!--,''
/ Font Definitions /,''
@font-face,''
/ Style Definitions / ,''
p.MsoNormal,''
i.MsoNormal,''
div.MsoNormal,''
a:link,''
.MsoHyperlink,''
a:visited,''
.MsoHyperlinkFollowed,''
p.msonormal0,''
li.msonormal0,''
div.msonormal0,''
p.emailquote,''
li.emailquote,''
div.emailquote,''
.E-mailStijl20,''
.MsoChpDefault,''
@page WordSection1,''
div.WordSection1,''
-->,''
/ Font Definitions /,''
/ Style Definitions /,''
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.d">,''
-name,''
p.MsoTitle, li.MsoTitle, div.MsoTitle,''
.TitleChar,''
.Heading2Char,''
.EmailStyle20,''
p.Default, li.Default, div.Default,''
/ List Definitions /,''
@list l0,''
:level1,''
:level2,''
:level3,''
:level4,''
:level5,''
:level6,''
:level7,''
:level8,''
:level9,''
] ;
// >>>>>>>>>>>>>>> Transformation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< //
JOIN(Email_Message_Table)
LOAD
RecNo,
MapSubstring('HtmlTag_Map', HTML_Message) as Description
Resident Email_Message_Table;
HI Ralf,
Greeting of the day,
How we can able to do for multiple columns ?
could you please advise on this.
Thanks & Best Regards,
Naga