Hi Smidy,
We started out with a collection of hard copy documents (books, papers, reports) that we wanted to make accessible to the public via the web. We scanned the docs into tif images and then ran some OCR on them to produce character output. Our original intent was to take the text versions of the docs and mark them up with XML and work completely off of the XML/Text versions of the docs. However, we quickly ran into a major time issue with the full XML markup. We wanted the XML docs to 'look' like the originals. We could do it, but, even with writing a custom GUI XML markup tool it still took hours to do a short document. So...... We produced pdf's from the tifs and display the pdfs once we know which document and page the use wants. A minimal XML mark up on the text versions gave us our search base and the pdf's made for good display to the client.
The software we use for the index is 'Isite'. It is free and available for several UNIX flavors and WinNT. Never heard of it, huh? It works well. The Isite package implements a communications protocol called z39.50 which is widely used by libraries that have their catalogues setup so they can search each other.
( See
and
)
You can get the most current version of Isite from
The Isite package contains an indexing tool, two index query clients (one for local queries and one for remote z39.50 queries via remote zservers), and a server application (zserver) you can use to make your index available to remote z39.50 clients. In our implementation, we use the indexing tool( Iindex ) and the local query client ( Isearch ).
With our docs in a minimally marked up state (title, authors, publisher, and page number) we build an index of the document files using Iindex. Each filename has a twin pdf. The Isite client ( Isearch ) enables fielded searching on any 'tagged' fields and full text searching. For each hit, it returns the doc title and file name. We then read the doc, and pull excerpts of the text which contain the term the user asked for and present each occurrence as a link to that page in the pdf.
Query => Show Doc titles => Show excerpts from selected doc => Show pdf page.
The Isite software allows for indexing of files in a directory or from a list. You can delete single docs from the index, and append to an existing index. I have another application that uses Isite to index a collection of over 800 web pages. The indexer takes about 12-15 minutes on an old slow Solaris box. For that 800 doc index, queries run in about 1-2 seconds ( same box). Isite has a switch to indicate that the files being indexed are HTML or XML and it automatically maps tagged fields in the index.
We are doing this with Perl and Apache on Solaris and the pdf file serving in being done by a RH Linux /Apache box.
Feel free to ask further if you are interested.
HTH
keep the rudder amid ship and beware the odd typo