Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to extract data from a website?

Hi,
i´ve got two websites. One Website wich supports SOAP, imports and so on.
Another Website wich keeps about 7000 html documents with an identical format with information in tables on it.
Now, with the relaunch, I have to transport content from the 7000 files to a database / CMS / SOAP.
I saw, that talend is able to connect to http.
Can I also extract data from html tables?
Thank you.
Bye, Chris

Labels (3)
20 Replies
Anonymous
Not applicable
Author

Ithink that There isn't any way to extract data from a html table but if you have only table you may use a regular expression
Anonymous
Not applicable
Author

Hello Chris,
as Olivier wrote, there is no special component. I had the same problem and it ends up in a tJavaRow with many regex. But that depends on your html structure. I've experimented a little bit with html2xml converter. If you search in google you should find different tools (including open source). At the end I could'nt use them because my input was very "unwell formed".
If you found a solution please give a us a feedback.
Bye
Volker
Anonymous
Not applicable
Author

I have written an OpenSource function for converting bad HTML to well-formed XML (http://sourceforge.net/projects/light-html2xml) and I would appreciate to test it with your input.
It is a single-pass automat and it does not need specific objects. It is not yet written in Java but in C# and in PHP5 (I will soon rewrite it in Java, especially if you're interested in...).
Anonymous
Not applicable
Author

Yes I think that it would be a really good idea to write it in java then I will create a specific talend component to perform this action
Anonymous
Not applicable
Author

Hi,
We use for internal stats some Talend jobs using http://cpan.uwinnipeg.ca/module/HTML::TokeParser in tPerl/tPerlRow. We may push on the stack a new component if you need it.
Hope this helps
Anonymous
Not applicable
Author

The Java version of the html2xml function I have written is now downloadable at http://sourceforge.net/projects/light-html2xml
Please send me your comments and remarks about it so I will fix bugs.
_AnonymousUser
Specialist III
Specialist III

Yes u can extract all data from 7000 pages. i m also working on this.
Anonymous
Not applicable
Author

I found another helpful thing for this:
http://www.iopus.com/imacros/firefox/?ref=fxmoz
Amazing tool to automate the web, even data extraction works fine.
One could combine the output which is e.g. Excel with Talend to get it into another database.
_AnonymousUser
Specialist III
Specialist III

User vder software, extract data from Amazon.com output to xml format. view screenshot: http://binhgiang.sourceforge.net/xmlalbum/slides/vietspider%20xml%20list%20detail%201.html
and download from: http://binhgiang.sourceforge.net/site/download.jsp