Extract Data from URL in Talend

Anonymous · ‎2018-03-16

hello everyone I'm trying to do crawling with talend, and I managed to do it even with the tHttpInputTable component found in the talend exchange, but also with java code by importing the jsoup library into the tJavaFlex component. The result is amazing to be able to do it on all the other sites is what I try to do, but I'm still new in this field, someone can make me a small overview and what I'm missing, for example the simple and static sites such as " http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe... "which is a rating site for movies , I can do it with a few lines of java, but for the most complex and certainly not static sites such as "https://www.risultati.it" which is a live soccer results site I can not, what I'm missing ?

Is JSOUP not powerful enough to crawl all kinds of sites? thanks in advance for those who have to devote some time and open a world in this field to a new one.

Anonymous · ‎2018-03-16

I wrote a tutorial on exactly this about 3 years ago. You can find it here: https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page

I suspect that better libraries are now available but the thing I learnt doing this is that you need to build your job with the website it is going to scrape in mind. Consider a website like XML. You can't build a single job to handle all types of XML. The same applies to websites.

I hope the tutorial gives you a few ideas.

Anonymous · ‎2018-03-19

Hi rhall_2_0, thank you so much for your solution, a very nice and very well structured tutorial.

but unfortunately this does not solve my problem, i tried your solution and I can not get out data of these dynamic sites, like this one "https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita" ,but are many other of this type, I'm interested a certain football match. and I can not do this because I'm not able to understand how data are exposed on the site in some way recall database I think. the html that you get with thttprequest in talend does not contains all of this data that i need... any other idea ?? thank you again in advanced

Anonymous · ‎2018-03-19

I've looked at the page source of the page you posted. It looks like you might struggle with this. It looks like this page has been written to obfuscate the data to prevent page scraping. There is a lot of Javascript used. I am not sure you will be able to do this. You *could* try saving the pages locally as HTML and then processing them. That *might* make it slightly easier.

Anonymous · ‎2018-03-20

yes, I noticed this, in fact I'm fighting with this thing for a while, it's not a very urgent thing, because I do not have to do this at work, but it's a very interesting thing I'd like to do. However, I would like to thank you once again for your time! and I hope to hear from you again if anyone can find a way to do it.

Anonymous · ‎2019-08-23

Hello
first of all, thank you for your time to help me
in fact, I want to parse XML / HTML from site https://www.cert.ssi.gouv.fr/

when i try to catch data from the HTML page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"

I'm expecting to have a table like that

ie I want to parse HTML and extract all the CERTEFs with a title and a publication date and all the VECs that it exists in each CERTEF
I do not know which component I can use and with which configuration that extracts exactly the same table

thank you for helping me

Anonymous · ‎2019-08-23

This is not going to be easy and there is no component I know of which will just do it for you. I think you will need to use a bit a code. I have written post which describes how I achieved something very similar. It is very complicated, but there will not be an "easy" way of achieving this I am afraid.

https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-...

Anonymous · ‎2019-08-27

first I thank you for your answer,
any way, I created a job as following but I have a problem in writing the codes

I searched between the questions in community and I find it but it doesn't work https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page

but I don't know how I can use this way for my project because the site has several div and pdf and link and the data is not exactly in the specific tables

thanks

regards

Anonymous · ‎2019-08-27

@mitra1367 I recreated the work that was hosted at the link that no longer works here: https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-...

I think I may have said that it is tricky and you need to understand the third party Java API that I mention. Take a look at the documentation for that.

Unfortunately scraping websites is notoriously hard because there is no standard way of displaying data. So your solution will usually be entirely bespoke to your problem

Big Data

Java

v6.x