Web Crawling and Text Analysis with Yahoo! Pipes and QlikView

    This is a re-post from my blog: TIQView

     

    A few years ago I worked a lot with Yahoo! Pipes to manage and aggregate RSS feeds about QlikView, data quality and other fields of interest for my Netvibes dashboard. Somehow, I stopped following this information overload at the end. However, a current project has the requirement to extract opinions from websites, especially from posts in forums and blogs (comments also) for further text, content and sentiment anaylsis.

     

    Now that I’ve remembered Pipes I created a pipe looping thru a RSS feed of a blog to get the full content of the post, not only the teaser text. Although you could also loop thru the feed’s links in QlikView I though it would be a nicer solution to have it all together in one QlikView load statement of a web source.

     

    Thanks to Barry Harmsen, I’m qualified to use his famous blog as a source of inspiration AND data for my example:

    The Qlik Fix! (don’t click :D)

     

    Here is the link to the pipe I’ve created: QlikView-Crawling-Example

    You can clone the pipe and edit it:

     

    Yahoo_Pipes_QlikView_01.png

    If you render the pipe as RSS and take a look into the web source (Ctrl+U in browser) you will see the snippet of the post content taken out of the website in the the tag <content:encoded>.

    The pipe will loop thru the RSS feed and will fetch the sub page of the link address and fetches the page’s content.

    The following xPath expression is used (discovered with Firebug from the related <div> tag): //*[@id=”content”] This will cut out the content of the post only, nothing more from the frame around..It will look like this in editor. I marked the importend properties:

     

    Yahoo_Pipes_QlikView_02.png

     

    In the next step everything is loaded into QlikView and the HTML tags got stripped (I’ve used my example code I posted on GIST also). Now we have the plain text for further use with text analysis and sentiment APIs.

     

    Yahoo_Pipes_QlikView_03.png

     

    You can downloiad the example here: QlikView-Crawling-Example.zip

    Please note to install the QlikView Minimalistic HtmlTextBox Object Extension from Stefan Walther (probably the extension with the longest name) before opening the QVW file.

    In the next post I will show how to process the plain text with text analysis and sentiment APIs. Have fun so far, keep on Qliking!

     

    - Ralf