Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
behi5542
Contributor
Contributor

Retrieve Data from HTML document and insert it into database

hi everyone,

i want to extract data from html document and insert it into a table, so i've made this job, in the tjavaflex i am using this code to remove html tags:

row15.content = row14.content.replaceAll("\\<.*?\\>", "").replaceAll("\\<\\?html(.+?)\\?\\>", "").replaceAll("<style([\\s\\S]+?)</style>", "").replaceAll("&nbsp;"," ").trim();

0695b00000RiXYMAA3.png

but after executing the job it always shows this error "Error on line 1 of document : Content is not allowed in prolog"

can anyone please help me to find a solution so i can extract the data from the html document. thanks in advance

Labels (5)
2 Replies
Anonymous
Not applicable

@behi behi​ the error' Conent is not allowed in prolog' is usually because there is something wrong in the file, you can read more discussions about this error on this page.

BTW, no a official component used to extract data from html document, here is a custom component shared by community user on Talend Exchange, you can download it and test if it fixes your need.

 

Regards

Shong

behi5542
Contributor
Contributor
Author

hi @Shicong Hong​ , thanks for replying, the problem is that the html document is stored locally on my laptop, that's why i had to use the regular expressions to remove html tags.

i don't know if thtmlinput is suitable in this case !!

 

PS: i also tried using the component TtikaExtractor but it didn't work as well

Regards