7 Replies Latest reply: Oct 15, 2012 5:01 AM by Sabrina Richter RSS

    Include Website-Data - div / ul / li - tags

    Sabrina Richter

      Hi all,

      in order to include website data into my QV application, I now come across some big problems.

      The problem is that the tables on the website are not implemented with <table> - tags but with a combination of <div>, <ul>, <li> - tags. I totally know how to read html, this is not the problem. The problem is: How can I easily get access to the data there?

      Here is an example of the given code:

       

      <html>
      <body>
      <div class="Topic">
           <dd>
                <ul>
                     <li class="Under Topic1">
                          <a href=""...>Text</a>
                     <li class="Under Topic1">
                          <a href=""...>Text</a>
                     <ul class="New">
                              <li>
                                    <a href="">Text</a>
                     ....
      

       

      Just to give you an idea of what we are talking about. The problem is also that there are several other html-tags in it.

      Does anyone has got an idea how to retrieve the data out of this structure?

      I got different ideas eg. trying to convert it to xml, just load the html into QV, try several reguloar expressions in an editor to get an better format but QV does not load a table in a table (if I try to convert in into <table>-tags, there are tables in tables).

      Maybe one of you has got a great idea.

      Thanks in advance!

       

      Kind regards,

      Michael

        • Re: Include Website-Data - div / ul / li - tags

          I am having this problem also

          • Re: Include Website-Data - div / ul / li - tags
            Ralf Becher

            You can load a web site line-wise like this:

             

            LOAD RecNo() as Rec, @1:n as Line

            FROM

            [http://en.wikipedia.org/wiki/QlikView]

            (fix, utf8);

             

            Then you can process/filter the rows of the resulting resident table and make string operations on the tags with string functions like Replace (tags) and BetweenText etc.

             

            But this all is a great effort..

             

            - Ralf

              • Re: Include Website-Data - div / ul / li - tags
                Sabrina Richter

                Hi,

                 

                that is defenitely great! But there is one more problem:

                The html-code is very unformatted. Is there an easy way of format it easily in QV (one line the open-tag, then content, then closing tag)?

                The only thing that has to be done is create a new record-line after every '>' but I don't know how to do that.

                Got an idea?

                  • Re: Include Website-Data - div / ul / li - tags
                    Ralf Becher

                    How this can work if the tags are nested?

                      • Re: Include Website-Data - div / ul / li - tags
                        Sabrina Richter

                        Yes, I tried now:

                         

                         

                         

                        html:
                        LOAD RecNo() as Rec, @1:n as Line
                        
                        FROM
                        
                        [Website]
                        
                        (fix, utf8);
                        
                        
                        
                        
                        
                        html1:
                        
                        Load
                        
                        subfield(Line,'>') & '>' as Line,
                        
                        RecNo() as ID
                        
                        Resident html;
                        
                        drop table html; 
                        
                        

                         

                        This is better now but the records are not in the right order by now and the end-tags are in the line of the content.

                        Thanks in advance!

                          • Re: Include Website-Data - div / ul / li - tags
                            Ralf Becher

                            With this script you can pull out the text (more or less..) between the HTML tags:

                             

                            html:

                            LOAD RecNo() as Rec, @1:n as Line

                            FROM

                            [http://en.wikipedia.org/wiki/QlikView]

                            (fix, utf8);

                             

                            html1:

                            NoConcatenate

                            Load * Where Line <> '>';

                            Load RecNo() as Rec,

                            subfield(Line,'>') & '>' as Line

                            Resident html

                            Order By Rec;

                             

                            html2:

                            NoConcatenate

                            Load * Where left(Line, 1) <> '<' and len(trim(Line))>0;

                            Load Rec, if(Right(Line,1)='>' and Left(Line, 1) <> '<', '</' & Line, Line) as Line;

                            Load Rec,

                            subfield(Line,'</') as Line

                            Resident html1

                            Order By Rec;

                             

                            drop tables html, html1;

                              • Re: Include Website-Data - div / ul / li - tags
                                Sabrina Richter

                                Hi,

                                thanks for this post.

                                As I am working on it now for some hours, here is my result:

                                 

                                SET
                                 ThousandSep='.';
                                
                                SET DecimalSep=',';
                                
                                SET MoneyThousandSep='.';
                                
                                SET MoneyDecimalSep=',';
                                
                                SET MoneyFormat='#.##0,00 €;-#.##0,00 €';
                                
                                SET TimeFormat='hh:mm:ss';
                                
                                SET DateFormat='DD.MM.YYYY';
                                
                                SET TimestampFormat='DD.MM.YYYY hh:mm:ss[.fff]';
                                
                                SET MonthNames='Jan;Feb;Mrz;Apr;Mai;Jun;Jul;Aug;Sep;Okt;Nov;Dez';
                                
                                SET DayNames='Mo;Di;Mi;Do;Fr;Sa;So';
                                
                                
                                
                                html:
                                
                                LOAD RecNo() as Rec, @1:n as Line
                                
                                FROM
                                
                                [C:\Users\richter\Desktop\test\block\block-a00-a09.htm]
                                
                                (fix, utf8);
                                
                                
                                
                                
                                
                                html1:
                                
                                Load
                                
                                subfield(Line,'>') & '>' as Line,
                                
                                rowno() as ID
                                
                                Resident html;
                                
                                drop table html;
                                
                                
                                
                                html2:
                                
                                Load 
                                
                                ID,
                                
                                Line,
                                
                                if(Trim(Line) <> '>',0,1) as del1,
                                
                                if(Line='<body>' or peek(del2,-1)=0, 0, 1) as del2
                                
                                resident html1;
                                
                                drop table html1;
                                
                                
                                
                                html3:
                                
                                Load 
                                
                                rowno() as ID,
                                
                                if(left(Line,1)='<',Line, if(right(Line,1)='>','<' & subfield(Line,'<'),Line)) as Line 
                                
                                Resident html2
                                
                                where del1=0 and del2=0;
                                
                                drop table html2;
                                
                                
                                
                                html4:
                                
                                Load ID,
                                
                                if(right(Line,1)='>', Line, right(Line, len(Line)-1)) as Linie
                                
                                Resident html3;
                                
                                drop table html3;
                                
                                
                                
                                html41:
                                
                                Load ID,
                                
                                Linie,
                                
                                if(Linie='<hr/>' or Linie='<br/>', 0, 1) as del
                                
                                Resident html4;
                                
                                drop table html4;
                                
                                
                                
                                html5:
                                
                                Load ID,
                                
                                Linie,
                                
                                if((left(Linie,1)='<' and Left(Linie,2)<> '</'  and peek(alteEbene,-1)<>'alt' and left(peek(Linie,-1),1)='<' and left(peek(Linie,-1),2)<>'</') or (left(Linie,1)<>'<' and peek(neueEbene,-1) = 'neu' and peek(alteEbene,-1)<>'alt') or (left(Linie,1)<>'<' and peek(neueEbene,-1) <> 'neu' and left(peek(Linie,-1),1)='<'), 'neu', 'nichtneu') as neueEbene,
                                
                                if(left(Linie,2)='</' and(left(peek(Linie,-1),2)='</' or left(peek(Linie,-1),1)<>'<' ) , 'alt', 'nichtalt') as alteEbene,
                                
                                1 as ebene
                                
                                Resident html41 where del='1';
                                
                                drop table html41;
                                
                                
                                
                                
                                
                                html6:
                                
                                Load ID,
                                
                                Linie,
                                
                                neueEbene,
                                
                                alteEbene,
                                
                                if(ID='1', 1, if(neueEbene='neu', peek(EbeneNummer,-1) +1,if(alteEbene='alt', peek(EbeneNummer,-1) -1,peek(EbeneNummer,-1)))) as EbeneNummer
                                
                                Resident html5;
                                
                                drop table html5; 
                                
                                

                                 

                                 

                                In this script several things are done. You can see the result by creating a diagram (Pivot table) having the ID on the left, the EbeneNummer (sorry for german field names though) on the top and as formula "Linie".

                                The only problem with this script would be some '<',... in the text of the website. I am lucky because all websites I want to include does not have anyone of them so I did not implemented a solution for this!