Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
hi everyone,
i want to extract data from html document and insert it into a table, so i've made this job, in the tjavaflex i am using this code to remove html tags:
row15.content = row14.content.replaceAll("\\<.*?\\>", "").replaceAll("\\<\\?html(.+?)\\?\\>", "").replaceAll("<style([\\s\\S]+?)</style>", "").replaceAll(" "," ").trim();
but after executing the job it always shows this error "Error on line 1 of document : Content is not allowed in prolog"
can anyone please help me to find a solution so i can extract the data from the html document. thanks in advance
@behi behi the error' Conent is not allowed in prolog' is usually because there is something wrong in the file, you can read more discussions about this error on this page.
BTW, no a official component used to extract data from html document, here is a custom component shared by community user on Talend Exchange, you can download it and test if it fixes your need.
Regards
Shong
hi @Shicong Hong , thanks for replying, the problem is that the html document is stored locally on my laptop, that's why i had to use the regular expressions to remove html tags.
i don't know if thtmlinput is suitable in this case !!
PS: i also tried using the component TtikaExtractor but it didn't work as well
Regards