Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
See why IDC MarketScape names Qlik a 2025 Leader! Read more
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to read data from a word file

Hi Talend Team,
I just wanted to read some data from a word file.
Is there any direct component which can read a word file .
Or is there any way to it ???
Regards,
Sandeep.
Labels (2)
4 Replies
Anonymous
Not applicable
Author

Hi,
there is a discussion on LinkedIn about this topic (or it was you who wrote the question? ( http://www.linkedin.com/groupItem?view=&gid=812977&type=member&item=107111395&qid=06608beb-085b-4573...)
Still I say - the problem with a word document is, that it is unstructured. I mean - it can contain tables, text, images, links, headers, other documents.. You could read data from an Excel sheet, but at least there are tables. So it doesn't go directly from a Word doc, but you need a a step to extract any structured information. In theory - you may create a script to save your word document as a clear text, but don't you loose any information?
If you know what is in the word document - e.g. CSV (comma separated values), you can use POI API or Visual Baisc to extract data from Word - usualy as delimited values (CSV) - and then Talend to do something useful with data.
Carpe diem
Gabriel
Anonymous
Not applicable
Author

Hi Gabriel,
First of all thank you for your reply.
I have a requirement where i have to read data from a Microsoft word file.
I am well aware that a word file is unstructured but i just want to match pattern in file and read data across it.
For Example :
Name : kathi
Place : USA
with a sepcified deilimeter .
I wanted to match this "name" and read data "kathi" in TOS.
Regards,
Sandeep.
Anonymous
Not applicable
Author

Hi Sandeep,
then I'd create a script using a POI API (or any Word manipulation API, e.g. Lucene ) to extract document's body clear text (I usually deploy all my routines as web services, it is easier and more accessible than trying to make a new Talend Component)- and then
- for every document (tFileList)
- extract content as clear text (tSSH, tWebService) into a temporary file
- read per row (tFileInputFullRow)
- check if file contains searched string (tFilterRow)
- read other rows necessary (tFileInputRegex)
but there is no out-of-the-box Talend component to extract clear text from a word document. In theory, you could reuse a WordExtractor from Lucene project (it uses POI as well).
Gabriel
Anonymous
Not applicable
Author

Hi Gabriel,
Thank you once again for your reply.
So, we can extract text using script of POI API.Can please mail or post procdure to create a sample job which would be of a great help to me.

Regards,
Sandeep.