Hi Talend Team, I just wanted to read some data from a word file. Is there any direct component which can read a word file . Or is there any way to it ??? Regards, Sandeep.
Hi,
there is a discussion on LinkedIn about this topic (or it was you who wrote the question? (
http://www.linkedin.com/groupItem?view=&gid=812977&type=member&item=107111395&qid=06608beb-085b-4573...)
Still I say - the problem with a word document is, that it is unstructured. I mean - it can contain tables, text, images, links, headers, other documents.. You could read data from an Excel sheet, but at least there are tables. So it doesn't go directly from a Word doc, but you need a a step to extract any structured information. In theory - you may create a script to save your word document as a clear text, but don't you loose any information?
If you know what is in the word document - e.g. CSV (comma separated values), you can use POI API or Visual Baisc to extract data from Word - usualy as delimited values (CSV) - and then Talend to do something useful with data.
Carpe diem
Gabriel
Hi Gabriel, First of all thank you for your reply. I have a requirement where i have to read data from a Microsoft word file. I am well aware that a word file is unstructured but i just want to match pattern in file and read data across it. For Example : Name : kathi Place : USA with a sepcified deilimeter . I wanted to match this "name" and read data "kathi" in TOS. Regards, Sandeep.
Hi Sandeep,
then I'd create a script using a POI API (or any Word manipulation API, e.g. Lucene ) to extract document's body clear text (I usually deploy all my routines as web services, it is easier and more accessible than trying to make a new Talend Component)- and then
- for every document (tFileList)
- extract content as clear text (tSSH, tWebService) into a temporary file
- read per row (tFileInputFullRow)
- check if file contains searched string (tFilterRow)
- read other rows necessary (tFileInputRegex)
but there is no out-of-the-box Talend component to extract clear text from a word document. In theory, you could reuse a WordExtractor from Lucene project (it uses POI as well).
Gabriel
Hi Gabriel,
Thank you once again for your reply.
So, we can extract text using script of POI API.Can please mail or post procdure to create a sample job which would be of a great help to me.