A few years ago I worked a lot with Yahoo! Pipes to manage and aggregate RSS feeds about QlikView, data quality and other fields of interest for my Netvibes dashboard. Somehow, I stopped following this information overload at the end. However, a current project has the requirement to extract opinions from websites, especially from posts in forums and blogs (comments also) for further text, content and sentiment anaylsis.
Now that I’ve remembered Pipes I created a pipe looping thru a RSS feed of a blog to get the full content of the post, not only the teaser text. Although you could also loop thru the feed’s links in QlikView I though it would be a nicer solution to have it all together in one QlikView load statement of a web source.
Thanks to Barry Harmsen, I’m qualified to use his famous blog as a source of inspiration AND data for my example:
If you render the pipe as RSS and take a look into the web source (Ctrl+U in browser) you will see the snippet of the post content taken out of the website in the the tag <content:encoded>.
The pipe will loop thru the RSS feed and will fetch the sub page of the link address and fetches the page’s content.
The following xPath expression is used (discovered with Firebug from the related <div> tag): //*[@id=”content”] This will cut out the content of the post only, nothing more from the frame around..It will look like this in editor. I marked the importend properties:
In the next step everything is loaded into QlikView and the HTML tags got stripped (I’ve used my example code I posted on GIST also). Now we have the plain text for further use with text analysis and sentiment APIs.