topic Re: parsing XML/HTML in Talend Studio

parsing XML/HTML

Anonymous — Fri, 23 Aug 2019 13:38:46 GMT

Hello everyone
first of all thank you for your time to help me
in fact I want to parsing xml / html from site https://www.cert.ssi.gouv.fr/
I'm expecting to have a table like that

ie I want to parsing html and extract all the CERTEFs with a title and a publication date and all the VECs that it exists in each CERTEF
I do not know which component I can use and with which configuration that extract exlace the same table

thank you for helping me

Re: parsing XML/HTML

fdenis — Tue, 27 Aug 2019 09:29:08 GMT

hi,
there is no component for that but you can open html pages as xml and parse tem using xml components.
!!be advice that today a lot of site are filling using javascript so you cannot directly access data!!!
is there a way to export data as xls or csv? if yes, it's the best way.
an other possibility is to use RPA (Robotic Process Autoation) to extract data from web.
good luck

Re: parsing XML/HTML

Anonymous — Tue, 27 Aug 2019 11:18:08 GMT

first I thank you for your answer, no I can extract the site as CSV, Xls, it is possible that you look at the site
but maybe I do not know how
any way, I created a job as following but I have a problem in writing the codes

I searched between the questions in community and I find it https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-...

but I do not know how I can use this way for my project because the site has several div and pdf and link and the data is not exactly in the specific tables

thanks

regards

Re: parsing XML/HTML

fdenis — Tue, 27 Aug 2019 12:39:22 GMT

thttprequest alow you to get http response like rest htlm or soap.
tJajaFlex is a free java code component. I think data are extracted in this component.
Regards,
good luck