Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

extract text from a pdf file

Status
Not open for further replies.

crackn101

Programmer
Dec 27, 2002
63
US
Hi everyone.
Here's my dilema.
I have a pdf file that contains 9 pages.
What i need to do is extract a portion of the text from pages 1 and 3, and archive it into a text file.
I have been to cpan, and looked at all of the modules I could find that had anything to do with pdfs.
Almost all of the modules there are for creating pdf files.
I did find one PDF::EXTRACT that would seperate each of the
pages into single 1 page pdf files, which also means that
I have direct access to the object created by PDF::EXTRACT.
However, as I'm sure that you know, the pdf data is
is some kind of binary. I should point out that
there is no security enabled in this pdf.
You can print, edit, change whatever you want using adobe acrobat.
The thing I want to upload this working script to my
web host provider, and start archiving this pdf on a daily basis. So i am stuck using whatever modules they have installed.
I currently am using ActiveState perl on my pc for testing
purposes.
Any suggstions would be greatly appreciated.
Thanks.
Crackn101
 
Hi,
Reader 5.0 lets you export a pdf file to text, so if you could figure out a way to automate that, then parse the resulting text file, I think it would be the easiest method.(if Reader is on the server)
Alternatively, you may want to look into Win32::OLE, although you you might need the full version of acrobat installed on the server to have access the the OLE object required. I'm not really sure.

It is possible to parse the binary PDF and extract compressed portions of text, then decompress it, but I think you'd need a fairly good knowledge of the internal structure of a PDF file. PDF files use several types of compression - for text I think they use either LZW or something called 'Flate'.

Hope this is of some help.. If I figure anything out I'll post it here. I work with PDF files a lot at work and this is something I'd like to do as well.
 
Thanks for the suggestion.
I saw this module at cpan, but didn't play around
with it very much. I think I will take a second look
at it and report back.
I do know that adobe has a pdf2html perl script hosted on
their site, that I could probably call with all of the
needed parameters. They have an html form setup for
the public to use.
Something to play with.
Thanks.
Take Care.

Crackn101
 
Keep in mind also that pdf is not a single format, its a wrapper format that can continue raster and vector data. What you get out will be HIGHLY dependant on what someone else put IN.
 
You might want to check out this site too.. at the bottom of the page, it says they have a tool for extracting text. I've tried their pdfsplit program and it works fairly well for small pdf files, but a memory hog with large files. (incidentally, it appears to have been written in perl, then compiled with perl2exe)
Anyway.. You did say you couldn't install anything on the server, but I thought I'd throw this out there.

 
From the pdfeverywhere web site.

PDF Text Extraction -- PDF Text extracts pure text information from the content of every page. The text content is extracted in the same order it's drawn on media thus may not be as well organized. Only deals with uncompressed or zip-compressed data. (07/2002)

Looks interesting - and Adobe certainly seem to know how to protect their investment, there don't seem to be many able to read and write PDF's properly.

Mike
12:54 12/11/2003 GMT

Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top