Retrieve Data from HTML document and insert it int... - Qlik Community

behi5542 · ‎2022-05-25

hi everyone,

i want to extract data from html document and insert it into a table, so i've made this job, in the tjavaflex i am using this code to remove html tags:

row15.content = row14.content.replaceAll("\\<.*?\\>", "").replaceAll("\\<\\?html(.+?)\\?\\>", "").replaceAll("<style([\\s\\S]+?)</style>", "").replaceAll(" "," ").trim();

but after executing the job it always shows this error "Error on line 1 of document : Content is not allowed in prolog"

can anyone please help me to find a solution so i can extract the data from the html document. thanks in advance

Anonymous · ‎2022-05-30

@behi behi the error' Conent is not allowed in prolog' is usually because there is something wrong in the file, you can read more discussions about this error on this page.

BTW, no a official component used to extract data from html document, here is a custom component shared by community user on Talend Exchange, you can download it and test if it fixes your need.

Regards

Shong

behi5542 · ‎2022-05-30

hi @Shicong Hong , thanks for replying, the problem is that the html document is stored locally on my laptop, that's why i had to use the regular expressions to remove html tags.

i don't know if thtmlinput is suitable in this case !!

PS: i also tried using the component TtikaExtractor but it didn't work as well

Regards

Retrieve Data from HTML document and insert it into database

Java

ORACLE

Other

Talend Data Integration

v8.x