extract text from a pdf file

crackn101 · Nov 8, 2003

Hi everyone.
Here's my dilema.
I have a pdf file that contains 9 pages.
What i need to do is extract a portion of the text from pages 1 and 3, and archive it into a text file.
I have been to cpan, and looked at all of the modules I could find that had anything to do with pdfs.
Almost all of the modules there are for creating pdf files.
I did find one PDF::EXTRACT that would seperate each of the
pages into single 1 page pdf files, which also means that
I have direct access to the object created by PDF::EXTRACT.
However, as I'm sure that you know, the pdf data is
is some kind of binary. I should point out that
there is no security enabled in this pdf.
You can print, edit, change whatever you want using adobe acrobat.
The thing I want to upload this working script to my
web host provider, and start archiving this pdf on a daily basis. So i am stuck using whatever modules they have installed.
I currently am using ActiveState perl on my pc for testing
purposes.
Any suggstions would be greatly appreciated.
Thanks.
Crackn101

chazoid · Nov 10, 2003

Hi,
Reader 5.0 lets you export a pdf file to text, so if you could figure out a way to automate that, then parse the resulting text file, I think it would be the easiest method.(if Reader is on the server)
Alternatively, you may want to look into Win32::OLE, although you you might need the full version of acrobat installed on the server to have access the the OLE object required. I'm not really sure.

It is possible to parse the binary PDF and extract compressed portions of text, then decompress it, but I think you'd need a fairly good knowledge of the internal structure of a PDF file. PDF files use several types of compression - for text I think they use either LZW or something called 'Flate'.

Hope this is of some help.. If I figure anything out I'll post it here. I work with PDF files a lot at work and this is something I'd like to do as well.

MikeLacey · Nov 11, 2003

http://theoryx5.uwinnipeg.ca/CPAN/data/PDF/PDF.html

also

Text:

DF looks very good, not trivial but might well do everything you want and more.

Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.

crackn101 · Nov 11, 2003

Thanks for the suggestion.
I saw this module at cpan, but didn't play around
with it very much. I think I will take a second look
at it and report back.
I do know that adobe has a pdf2html perl script hosted on
their site, that I could probably call with all of the
needed parameters. They have an html form setup for
the public to use.
Something to play with.
Thanks.
Take Care.

Crackn101

siberian · Nov 11, 2003

Keep in mind also that pdf is not a single format, its a wrapper format that can continue raster and vector data. What you get out will be HIGHLY dependant on what someone else put IN.

chazoid · Nov 11, 2003

You might want to check out this site too.. at the bottom of the page, it says they have a tool for extracting text. I've tried their pdfsplit program and it works fairly well for small pdf files, but a memory hog with large files. (incidentally, it appears to have been written in perl, then compiled with perl2exe)
Anyway.. You did say you couldn't install anything on the server, but I thought I'd throw this out there.

http://www.pdfeverywhere.com/

MikeLacey · Nov 12, 2003

From the pdfeverywhere web site.

PDF Text Extraction -- PDF Text extracts pure text information from the content of every page. The text content is extracted in the same order it's drawn on media thus may not be as well organized. Only deals with uncompressed or zip-compressed data. (07/2002)

Looks interesting - and Adobe certainly seem to know how to protect their investment, there don't seem to be many able to read and write PDF's properly.

Mike
12:54 12/11/2003 GMT

Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extract text from a pdf file

crackn101

Programmer

chazoid

Technical User

MikeLacey

MIS

crackn101

Programmer

siberian

Programmer

chazoid

Technical User

MikeLacey

MIS

Similar threads

Part and Inventory Search

Sponsor