Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi rhall_2_0, thank you so much for your solution, a very nice and very well structured tutorial.
but unfortunately this does not solve my problem, i tried your solution and I can not get out data of these dynamic sites, like this one "https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita" ,but are many other of this type, I'm interested a certain football match. and I can not do this because I'm not able to understand how data are exposed on the site in some way recall database I think. the html that you get with thttprequest in talend does not contains all of this data that i need... any other idea ?? thank you again in advanced
I've looked at the page source of the page you posted. It looks like you might struggle with this. It looks like this page has been written to obfuscate the data to prevent page scraping. There is a lot of Javascript used. I am not sure you will be able to do this. You *could* try saving the pages locally as HTML and then processing them. That *might* make it slightly easier.
yes, I noticed this, in fact I'm fighting with this thing for a while, it's not a very urgent thing, because I do not have to do this at work, but it's a very interesting thing I'd like to do. However, I would like to thank you once again for your time! and I hope to hear from you again if anyone can find a way to do it.
Hello
first of all, thank you for your time to help me
in fact, I want to parse XML / HTML from site https://www.cert.ssi.gouv.fr/
when i try to catch data from the HTML page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"
I'm expecting to have a table like that
ie I want to parse HTML and extract all the CERTEFs with a title and a publication date and all the VECs that it exists in each CERTEF
I do not know which component I can use and with which configuration that extracts exactly the same table
thank you for helping me
This is not going to be easy and there is no component I know of which will just do it for you. I think you will need to use a bit a code. I have written post which describes how I achieved something very similar. It is very complicated, but there will not be an "easy" way of achieving this I am afraid.
first I thank you for your answer,
any way, I created a job as following but I have a problem in writing the codes
I searched between the questions in community and I find it but it doesn't work https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page
but I don't know how I can use this way for my project because the site has several div and pdf and link and the data is not exactly in the specific tables
thanks
regards
@mitra1367 I recreated the work that was hosted at the link that no longer works here: https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-...
I think I may have said that it is tricky and you need to understand the third party Java API that I mention. Take a look at the documentation for that.
Unfortunately scraping websites is notoriously hard because there is no standard way of displaying data. So your solution will usually be entirely bespoke to your problem