Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
renjithpl
Specialist
Specialist

Data from PDF

Hi,

I have very very urget requirement,

I have lot of pdf files (ocr converted) which is in table format.

I just want to read raw data from those pdf and load it in qlikview.

Is there any any way of doing it.

I tried copy pasting in excel, but some of alphabets (eg. S may change in to 8 and L may change into 1)

so what would you guys suggest, i am very eager to know about this...

Please help me.

Ren

1 Solution

Accepted Solutions
prieper
Master II
Master II

Good day,

as kind of starting-idea you may use the below script, which reads only the first pages.
Major problem are the garbled base-data, meaning that whenever your OCR thinks that there needs a line to be added (like row 406). You may bring in a further validation to trap this error, but you will run into serious problem, whenever the OCR breaks one record into two lines.....
Also might make sense to transform the Dollar Amount into correct numbers (eliminating "S" at the end, which might have been originally "$", also eliminating the "," as thousand-separator etc).
You may also need to grap some header-information from the text, like bank account or the like, or to break up the date from the Reference etc.etc.

Script might be:

// ==== Load Line by Line and identify the headers ========================
RawData:
LOAD
*,
RowNo() AS LineNo,
IF(WILDMATCH(Line, '*SERIAL*', '*DOLLAR*', '*REFERENCE*'),
'HEADER', 'DATA') AS DataIdentifyer;
LOAD
@1:n AS Line
FROM
[.\test PDF2TXT.txt]
(fix, codepage is 1252);
// ==== Indentify headers and starting-lines of each section ==============
StartingPoints:
LOAD
LineNo +2 AS StartRow,
Line AS Header
RESIDENT
RawData
WHERE
DataIdentifyer = 'HEADER'
ORDER BY
LineNo DESC;

// ==== From each Header downwards load 49 records ========================
FOR i = 0 TO NOOFROWS('StartingPoints') -1
LET sHeader = '[' & PEEK('Header', i, 'StartingPoints') & ']'; // Title of field
LET iDataStart = PEEK('StartRow', i, 'StartingPoints'); // Starting Row to read
LET iDataEnd = iDataStart + 49;
IF i = 0 THEN // Load Parameter
LET sLoadString = 'LOAD ';
ELSE
LET sLoadString = 'JOIN (Data) LOAD ';
END IF

Data:
$(sLoadString)
RowNo() AS RowNo_PartTable, // Join-Parameter
Line AS $(sHeader)
RESIDENT
RawData
WHERE
LineNo > $(iDataStart)
AND
LineNo < $(iDataEnd);
NEXT i

DROP TABLE RawData;
DROP TABLE StartingPoints;


HTH
Peter

View solution in original post

8 Replies
Not applicable

With PDF you still have a bunch of scanned files. Configure the OCR software to write to text files instead

-Alex

prieper
Master II
Master II

You may also use - at least with Adobe Reader - the possibility to export to Text, provided the initial document is not scanned, but created by using an PDF-printer.

HTH
Peter

renjithpl
Specialist
Specialist
Author

Hi Peter,

This is not an answer, but a question again,

I have saved my pdf file as .txt format,

see the attachement,

There is a "Serial Number" with 49 entries, then comes the "Dollar value", then "Date ISS" and then Reference Number

Please guide me how to load data from .txt, since this is my first time where i will be loading data from .txt file.

the .txt file has lot of pages, where ever the serialno, dollar value, date iss and referenceno in the file, it should read from one .txt file.

is it possible, or you can give any kind of suggestion.

My out put should look like.

SerialNODollarValueDateISSReferenceNO
638112285.00S1/17/2002S2000849
638120715,465.00S1/17/2002S2000934
6381212720.00S1/17/2002S2000939
6381300599.99S1/18/2002A2003262


THanks in advance,

renjithpl
Specialist
Specialist
Author

hi peter,

in my post i have attached the .txt file... Please have a look at it, and reply if you can figure out some kind of solution.

thanks

prieper
Master II
Master II

Is the PDF-file really having the columns shown one below the other?
It seems to me very much to be scanned text run through an OCX-program.


If it is always the serial with x entries, then followed by Dollar with exactly x entries etc, it should be workable, but looking into the text-file in the second section starting in line 240 there are only 17 entries for Dollar, then some references and then probably Dollars again.
This seems to be lots of garbage and might be cleansed within the program creating the PDF.

++
Peter

renjithpl
Specialist
Specialist
Author

Hi Peter

Thanks for your reply, i know the data is really a mess. Ok lets take it this way. Take first two Pages where the data is almost clear with 49 entries, then how can i load data from those two pages. you can delete all other Pages and keep first two Pages so that it can be pretty clear.

You can even take one page too and work out.

The problem is i dont know how to load .txt file if such type of .txt file is given.

Thanks a lot Peter. 🙂

prieper
Master II
Master II

Good day,

as kind of starting-idea you may use the below script, which reads only the first pages.
Major problem are the garbled base-data, meaning that whenever your OCR thinks that there needs a line to be added (like row 406). You may bring in a further validation to trap this error, but you will run into serious problem, whenever the OCR breaks one record into two lines.....
Also might make sense to transform the Dollar Amount into correct numbers (eliminating "S" at the end, which might have been originally "$", also eliminating the "," as thousand-separator etc).
You may also need to grap some header-information from the text, like bank account or the like, or to break up the date from the Reference etc.etc.

Script might be:

// ==== Load Line by Line and identify the headers ========================
RawData:
LOAD
*,
RowNo() AS LineNo,
IF(WILDMATCH(Line, '*SERIAL*', '*DOLLAR*', '*REFERENCE*'),
'HEADER', 'DATA') AS DataIdentifyer;
LOAD
@1:n AS Line
FROM
[.\test PDF2TXT.txt]
(fix, codepage is 1252);
// ==== Indentify headers and starting-lines of each section ==============
StartingPoints:
LOAD
LineNo +2 AS StartRow,
Line AS Header
RESIDENT
RawData
WHERE
DataIdentifyer = 'HEADER'
ORDER BY
LineNo DESC;

// ==== From each Header downwards load 49 records ========================
FOR i = 0 TO NOOFROWS('StartingPoints') -1
LET sHeader = '[' & PEEK('Header', i, 'StartingPoints') & ']'; // Title of field
LET iDataStart = PEEK('StartRow', i, 'StartingPoints'); // Starting Row to read
LET iDataEnd = iDataStart + 49;
IF i = 0 THEN // Load Parameter
LET sLoadString = 'LOAD ';
ELSE
LET sLoadString = 'JOIN (Data) LOAD ';
END IF

Data:
$(sLoadString)
RowNo() AS RowNo_PartTable, // Join-Parameter
Line AS $(sHeader)
RESIDENT
RawData
WHERE
LineNo > $(iDataStart)
AND
LineNo < $(iDataEnd);
NEXT i

DROP TABLE RawData;
DROP TABLE StartingPoints;


HTH
Peter

renjithpl
Specialist
Specialist
Author

Thanks a lot Peter, it worked good, 🙂 i got a very good idea about how to load data from .txt as well.

i tried for the bigger file, its taking some time, so i am splitting the .txt file in to many and tried your method. Great.

God bless you..