serve one page of a PDF from apache??

goBoating · Dec 21, 2000

I have server thousand public domain docs in PDF format that I need to distribute over the web. My system has a search interface which enables users to search the collection for relevant docs, preview the pages returned by the search, and then download the full doc if they want it. Problem is that the PDF files are large, 10 megs and larger, up to 70megs. Currently, the user's browser downloads the entire PDF before displaying the specified page. I have seen several references in the Adobe PDF documentation to setting up Apache to do 'byte serving', thus enabling the delivery of the desired page first. Does anyone know anything about doing this? I'm a Perl programmer and have not played with PDFs before, so, I'm hunting truffles.

Thanks

keep the rudder amid ship and beware the odd typo

goBoating · Feb 9, 2001

In case anyone else runs into this.....
The byte-serving is easier than first appeared. I mistakenly though there was something that needed to be done in the Apache web server. Not the case. Just save the PDF with the 'optimize' option ( a checkbox in the "file->save" dialogue box). With the Acrobat reader plugin, ask for the doc like this....

http://someserver.com/some.pdf#page=25

That will spit out page 25 of the pdf file first so that it is displayed and then continues to download the rest of the doc in the background.

keep the rudder amid ship and beware the odd typo

Smidy · Mar 26, 2001

Hi,

Your first question about searching PDF documents over a web search engine running Apache interests me.

What software do you use to create the index file or however you enable keywork lookup of PDF documents??

Thanks,
George.

goBoating · Mar 26, 2001

Hi Smidy,

We started out with a collection of hard copy documents (books, papers, reports) that we wanted to make accessible to the public via the web. We scanned the docs into tif images and then ran some OCR on them to produce character output. Our original intent was to take the text versions of the docs and mark them up with XML and work completely off of the XML/Text versions of the docs. However, we quickly ran into a major time issue with the full XML markup. We wanted the XML docs to 'look' like the originals. We could do it, but, even with writing a custom GUI XML markup tool it still took hours to do a short document. So...... We produced pdf's from the tifs and display the pdfs once we know which document and page the use wants. A minimal XML mark up on the text versions gave us our search base and the pdf's made for good display to the client.

The software we use for the index is 'Isite'. It is free and available for several UNIX flavors and WinNT. Never heard of it, huh? It works well. The Isite package implements a communications protocol called z39.50 which is widely used by libraries that have their catalogues setup so they can search each other.
( See

http://lcweb.loc.gov/z3950/gateway.html#about

and

http://lcweb.loc.gov/z3950/agency/

)

You can get the most current version of Isite from

http://clearinghouse2.fgdc.gov/ftp/.

The Isite package contains an indexing tool, two index query clients (one for local queries and one for remote z39.50 queries via remote zservers), and a server application (zserver) you can use to make your index available to remote z39.50 clients. In our implementation, we use the indexing tool( Iindex ) and the local query client ( Isearch ).

With our docs in a minimally marked up state (title, authors, publisher, and page number) we build an index of the document files using Iindex. Each filename has a twin pdf. The Isite client ( Isearch ) enables fielded searching on any 'tagged' fields and full text searching. For each hit, it returns the doc title and file name. We then read the doc, and pull excerpts of the text which contain the term the user asked for and present each occurrence as a link to that page in the pdf.

Query => Show Doc titles => Show excerpts from selected doc => Show pdf page.

The Isite software allows for indexing of files in a directory or from a list. You can delete single docs from the index, and append to an existing index. I have another application that uses Isite to index a collection of over 800 web pages. The indexer takes about 12-15 minutes on an old slow Solaris box. For that 800 doc index, queries run in about 1-2 seconds ( same box). Isite has a switch to indicate that the files being indexed are HTML or XML and it automatically maps tagged fields in the index.

We are doing this with Perl and Apache on Solaris and the pdf file serving in being done by a RH Linux /Apache box.

Feel free to ask further if you are interested.

HTH

keep the rudder amid ship and beware the odd typo

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

serve one page of a PDF from apache??

goBoating

Programmer

goBoating

Programmer

Smidy

ISP

goBoating

Programmer

Similar threads

Part and Inventory Search

Sponsor