Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
richters
Partner - Contributor III
Partner - Contributor III

Include Website-Data - div / ul / li - tags

Hi all,

in order to include website data into my QV application, I now come across some big problems.

The problem is that the tables on the website are not implemented with <table> - tags but with a combination of <div>, <ul>, <li> - tags. I totally know how to read html, this is not the problem. The problem is: How can I easily get access to the data there?

Here is an example of the given code:

<html>

<body>

<div class="Topic">

     <dd>

          <ul>

               <li class="Under Topic1">

                    <a href=""...>Text</a>

               <li class="Under Topic1">

                    <a href=""...>Text</a>

               <ul class="New">

                        <li>

                              <a href="">Text</a>

               ....

Just to give you an idea of what we are talking about. The problem is also that there are several other html-tags in it.

Does anyone has got an idea how to retrieve the data out of this structure?

I got different ideas eg. trying to convert it to xml, just load the html into QV, try several reguloar expressions in an editor to get an better format but QV does not load a table in a table (if I try to convert in into <table>-tags, there are tables in tables).

Maybe one of you has got a great idea.

Thanks in advance!

Kind regards,

Michael

1 Solution

Accepted Solutions
rbecher
MVP
MVP

With this script you can pull out the text (more or less..) between the HTML tags:

html:

LOAD RecNo() as Rec, @1:n as Line

FROM

[http://en.wikipedia.org/wiki/QlikView]

(fix, utf8);

html1:

NoConcatenate

Load * Where Line <> '>';

Load RecNo() as Rec,

subfield(Line,'>') & '>' as Line

Resident html

Order By Rec;

html2:

NoConcatenate

Load * Where left(Line, 1) <> '<' and len(trim(Line))>0;

Load Rec, if(Right(Line,1)='>' and Left(Line, 1) <> '<', '</' & Line, Line) as Line;

Load Rec,

subfield(Line,'</') as Line

Resident html1

Order By Rec;

drop tables html, html1;

Astrato.io Head of R&D

View solution in original post

7 Replies
Not applicable

I am having this problem also

rbecher
MVP
MVP

You can load a web site line-wise like this:

LOAD RecNo() as Rec, @1:n as Line

FROM

[http://en.wikipedia.org/wiki/QlikView]

(fix, utf8);

Then you can process/filter the rows of the resulting resident table and make string operations on the tags with string functions like Replace (tags) and BetweenText etc.

But this all is a great effort..

- Ralf

Astrato.io Head of R&D
richters
Partner - Contributor III
Partner - Contributor III
Author

Hi,

that is defenitely great! But there is one more problem:

The html-code is very unformatted. Is there an easy way of format it easily in QV (one line the open-tag, then content, then closing tag)?

The only thing that has to be done is create a new record-line after every '>' but I don't know how to do that.

Got an idea?

rbecher
MVP
MVP

How this can work if the tags are nested?

Astrato.io Head of R&D
richters
Partner - Contributor III
Partner - Contributor III
Author

Yes, I tried now:

html:

LOAD RecNo() as Rec, @1:n as Line

FROM

[Website]

(
fix, utf8);





html1:

Load

subfield(Line,'>') & '>' as Line,

RecNo() as ID

Resident html;

drop table html;

This is better now but the records are not in the right order by now and the end-tags are in the line of the content.

Thanks in advance!

rbecher
MVP
MVP

With this script you can pull out the text (more or less..) between the HTML tags:

html:

LOAD RecNo() as Rec, @1:n as Line

FROM

[http://en.wikipedia.org/wiki/QlikView]

(fix, utf8);

html1:

NoConcatenate

Load * Where Line <> '>';

Load RecNo() as Rec,

subfield(Line,'>') & '>' as Line

Resident html

Order By Rec;

html2:

NoConcatenate

Load * Where left(Line, 1) <> '<' and len(trim(Line))>0;

Load Rec, if(Right(Line,1)='>' and Left(Line, 1) <> '<', '</' & Line, Line) as Line;

Load Rec,

subfield(Line,'</') as Line

Resident html1

Order By Rec;

drop tables html, html1;

Astrato.io Head of R&D
richters
Partner - Contributor III
Partner - Contributor III
Author

Hi,

thanks for this post.

As I am working on it now for some hours, here is my result:

SET

ThousandSep='.';

SET DecimalSep=',';

SET MoneyThousandSep='.';

SET MoneyDecimalSep=',';

SET MoneyFormat='#.##0,00 €;-#.##0,00 €';

SET TimeFormat='hh:mm:ss';

SET DateFormat='DD.MM.YYYY';

SET TimestampFormat='DD.MM.YYYY hh:mm:ss[.fff]';

SET MonthNames='Jan;Feb;Mrz;Apr;Mai;Jun;Jul;Aug;Sep;Okt;Nov;Dez';

SET DayNames='Mo;Di;Mi;Do;Fr;Sa;So';



html:

LOAD RecNo() as Rec, @1:n as Line

FROM



(
fix, utf8);





html1:

Load

subfield(Line,'>') & '>' as Line,

rowno() as ID

Resident html;

drop table html;



html2:

Load

ID,

Line,

if(Trim(Line) <> '>',0,1) as del1,

if(Line='<body>' or peek(del2,-1)=0, 0, 1) as del2

resident html1;

drop table html1;



html3:

Load

rowno() as ID,

if(left(Line,1)='<',Line, if(right(Line,1)='>','<' & subfield(Line,'<'),Line)) as Line

Resident html2

where del1=0 and del2=0;

drop table html2;



html4:

Load ID,

if(right(Line,1)='>', Line, right(Line, len(Line)-1)) as Linie

Resident html3;

drop table html3;



html41:

Load ID,

Linie,

if(Linie='<hr/>' or Linie='<br/>', 0, 1) as del

Resident html4;

drop table html4;



html5:

Load ID,

Linie,

if((left(Linie,1)='<' and Left(Linie,2)<> '</'  and peek(alteEbene,-1)<>'alt' and left(peek(Linie,-1),1)='<' and left(peek(Linie,-1),2)<>'</') or (left(Linie,1)<>'<' and peek(neueEbene,-1) = 'neu' and peek(alteEbene,-1)<>'alt') or (left(Linie,1)<>'<' and peek(neueEbene,-1) <> 'neu' and left(peek(Linie,-1),1)='<'), 'neu', 'nichtneu') as neueEbene,

if(left(Linie,2)='</' and(left(peek(Linie,-1),2)='</' or left(peek(Linie,-1),1)<>'<' ) , 'alt', 'nichtalt') as alteEbene,

1
as ebene

Resident html41 where del='1';

drop table html41;





html6:

Load ID,

Linie,

neueEbene,

alteEbene,

if(ID='1', 1, if(neueEbene='neu', peek(EbeneNummer,-1) +1,if(alteEbene='alt', peek(EbeneNummer,-1) -1,peek(EbeneNummer,-1)))) as EbeneNummer

Resident html5;

drop table html5;

In this script several things are done. You can see the result by creating a diagram (Pivot table) having the ID on the left, the EbeneNummer (sorry for german field names though) on the top and as formula "Linie".

The only problem with this script would be some '<',... in the text of the website. I am lucky because all websites I want to include does not have anyone of them so I did not implemented a solution for this!