Solved: Data from PDF - Qlik Community

renjithpl · ‎2010-11-02

Hi,

I have very very urget requirement,

I have lot of pdf files (ocr converted) which is in table format.

I just want to read raw data from those pdf and load it in qlikview.

Is there any any way of doing it.

I tried copy pasting in excel, but some of alphabets (eg. S may change in to 8 and L may change into 1)

so what would you guys suggest, i am very eager to know about this...

Please help me.

Ren

prieper · ‎2010-11-05

Good day,

as kind of starting-idea you may use the below script, which reads only the first pages.
Major problem are the garbled base-data, meaning that whenever your OCR thinks that there needs a line to be added (like row 406). You may bring in a further validation to trap this error, but you will run into serious problem, whenever the OCR breaks one record into two lines.....
Also might make sense to transform the Dollar Amount into correct numbers (eliminating "S" at the end, which might have been originally "$", also eliminating the "," as thousand-separator etc).
You may also need to grap some header-information from the text, like bank account or the like, or to break up the date from the Reference etc.etc.

Script might be:

// ==== Load Line by Line and identify the headers ========================
 RawData:
 LOAD
 *,
 RowNo() AS LineNo,
 IF(WILDMATCH(Line, '*SERIAL*', '*DOLLAR*', '*REFERENCE*'),
 'HEADER', 'DATA') AS DataIdentifyer;
 LOAD
 @1:n AS Line
 FROM
 [.\test PDF2TXT.txt]
 (fix, codepage is 1252);
// ==== Indentify headers and starting-lines of each section ==============
 StartingPoints:
 LOAD
 LineNo +2 AS StartRow,
 Line AS Header
 RESIDENT
 RawData
 WHERE
 DataIdentifyer = 'HEADER'
 ORDER BY
 LineNo DESC;

// ==== From each Header downwards load 49 records ========================
 FOR i = 0 TO NOOFROWS('StartingPoints') -1
 LET sHeader = '[' & PEEK('Header', i, 'StartingPoints') & ']'; // Title of field
 LET iDataStart = PEEK('StartRow', i, 'StartingPoints'); // Starting Row to read
 LET iDataEnd = iDataStart + 49;
 IF i = 0 THEN // Load Parameter
 LET sLoadString = 'LOAD ';
 ELSE
 LET sLoadString = 'JOIN (Data) LOAD ';
 END IF

 Data:
 $(sLoadString)
 RowNo() AS RowNo_PartTable, // Join-Parameter
 Line AS $(sHeader)
 RESIDENT
 RawData
 WHERE
 LineNo > $(iDataStart)
 AND
 LineNo < $(iDataEnd);
 NEXT i

DROP TABLE RawData;
DROP TABLE StartingPoints;

HTH
Peter

View solution in original post

Report Inappropriate Content · ‎2010-11-02

With PDF you still have a bunch of scanned files. Configure the OCR software to write to text files instead

-Alex

prieper · ‎2010-11-02

You may also use - at least with Adobe Reader - the possibility to export to Text, provided the initial document is not scanned, but created by using an PDF-printer.

HTH
Peter

renjithpl · ‎2010-11-02

Hi Peter,

This is not an answer, but a question again,

I have saved my pdf file as .txt format,

see the attachement,

There is a "Serial Number" with 49 entries, then comes the "Dollar value", then "Date ISS" and then Reference Number

Please guide me how to load data from .txt, since this is my first time where i will be loading data from .txt file.

the .txt file has lot of pages, where ever the serialno, dollar value, date iss and referenceno in the file, it should read from one .txt file.

is it possible, or you can give any kind of suggestion.

My out put should look like.

SerialNO	DollarValue	DateISS	ReferenceNO
6381122	85.00S	1/17/2002	S2000849
6381207	15,465.00S	1/17/2002	S2000934
6381212	720.00S	1/17/2002	S2000939
6381300	599.99S	1/18/2002	A2003262

THanks in advance,

renjithpl · ‎2010-11-02

hi peter,

in my post i have attached the .txt file... Please have a look at it, and reply if you can figure out some kind of solution.

thanks

prieper · ‎2010-11-02

Is the PDF-file really having the columns shown one below the other?
It seems to me very much to be scanned text run through an OCX-program.

If it is always the serial with x entries, then followed by Dollar with exactly x entries etc, it should be workable, but looking into the text-file in the second section starting in line 240 there are only 17 entries for Dollar, then some references and then probably Dollars again.
This seems to be lots of garbage and might be cleansed within the program creating the PDF.

++
Peter

renjithpl · ‎2010-11-02

Hi Peter

Thanks for your reply, i know the data is really a mess. Ok lets take it this way. Take first two Pages where the data is almost clear with 49 entries, then how can i load data from those two pages. you can delete all other Pages and keep first two Pages so that it can be pretty clear.

You can even take one page too and work out.

The problem is i dont know how to load .txt file if such type of .txt file is given.

Thanks a lot Peter. 🙂

prieper · ‎2010-11-05

Good day,

as kind of starting-idea you may use the below script, which reads only the first pages.
Major problem are the garbled base-data, meaning that whenever your OCR thinks that there needs a line to be added (like row 406). You may bring in a further validation to trap this error, but you will run into serious problem, whenever the OCR breaks one record into two lines.....
Also might make sense to transform the Dollar Amount into correct numbers (eliminating "S" at the end, which might have been originally "$", also eliminating the "," as thousand-separator etc).
You may also need to grap some header-information from the text, like bank account or the like, or to break up the date from the Reference etc.etc.

Script might be:

// ==== Load Line by Line and identify the headers ========================
 RawData:
 LOAD
 *,
 RowNo() AS LineNo,
 IF(WILDMATCH(Line, '*SERIAL*', '*DOLLAR*', '*REFERENCE*'),
 'HEADER', 'DATA') AS DataIdentifyer;
 LOAD
 @1:n AS Line
 FROM
 [.\test PDF2TXT.txt]
 (fix, codepage is 1252);
// ==== Indentify headers and starting-lines of each section ==============
 StartingPoints:
 LOAD
 LineNo +2 AS StartRow,
 Line AS Header
 RESIDENT
 RawData
 WHERE
 DataIdentifyer = 'HEADER'
 ORDER BY
 LineNo DESC;

// ==== From each Header downwards load 49 records ========================
 FOR i = 0 TO NOOFROWS('StartingPoints') -1
 LET sHeader = '[' & PEEK('Header', i, 'StartingPoints') & ']'; // Title of field
 LET iDataStart = PEEK('StartRow', i, 'StartingPoints'); // Starting Row to read
 LET iDataEnd = iDataStart + 49;
 IF i = 0 THEN // Load Parameter
 LET sLoadString = 'LOAD ';
 ELSE
 LET sLoadString = 'JOIN (Data) LOAD ';
 END IF

 Data:
 $(sLoadString)
 RowNo() AS RowNo_PartTable, // Join-Parameter
 Line AS $(sHeader)
 RESIDENT
 RawData
 WHERE
 LineNo > $(iDataStart)
 AND
 LineNo < $(iDataEnd);
 NEXT i

DROP TABLE RawData;
DROP TABLE StartingPoints;

HTH
Peter

renjithpl · ‎2010-11-07

Thanks a lot Peter, it worked good, 🙂 i got a very good idea about how to load data from .txt as well.

i tried for the bigger file, its taking some time, so i am splitting the .txt file in to many and tried your method. Great.

God bless you..

Data from PDF

Related Topics