×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

searchable pdf / word docs

searchable pdf / word docs

searchable pdf / word docs

(OP)
How would I go about building a search facility that could include the content of Word and PDF documents?

Do I need to consider some form of meta data that is stored in the DB with the document?

Would it not be practical to do a real-time search of a folder containing a bunch of Word and PDF docs?

All input greatly received.

1DMF

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Free Dance Music Downloads

RE: searchable pdf / word docs

I've never actually coded anything like this, but I've thought about it a few times in response to some user requirements.

I'd always imagined having some kind of many-to-many relationship between the words you parse out of the documents, and the documents themselves (on a DBMS). That way you'd be able to search for a number of keywords and the document that had the most matches (as a count) would be the one you'd want to see at the top of the list. It also means that you'd be able to expend the CPU cycles (once) to parse and analyse the documents as they arrive, and the search cost would be lower. Kind of like a poor man's Google...

Steve

"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::PerlDesignPatterns)

RE: searchable pdf / word docs

If those documents are permanent and there are hundreds or thousands of them, then you should go with including their full text content in a DB.
However not all pdf's contain text (scanned pages), with those you can only go with an OCR application (that cannot be fully automatic). Also, if you want to write a word or pdf text extraction routine, this is not a simple task (different versions...).

http://www.xcalcs.com : Online engineering calculations
http://www.megamag.it : Magnetic brakes for fun rides
http://www.levitans.com : Air bearing pads

RE: searchable pdf / word docs

prex1 is right about the scanned-content PDFs although many scanning applications have OCR which adds machine readable text based on the content of the scanned document into the PDF.

I was thinking of a workflow which went something like:
  1. parse the text of the document using split or some other regex-based mechanism
  2. normalise the 'words' by lower-casing, punctuation removal etc.
  3. exclude one, two, and some common three and four letter words like and, is, when, etc.
  4. use a hash with the word as a key and a count as a value, zip through the parsed words to get a count for this document
  5. add the document locator (file name, URL etc.) and document ID to the 'document' table
  6. for each item in your hash, check if 'word' exists on the word table, and if not, insert it
  7. add a row to the document-word table with the document ID, the word ID, and the count
I think that if your DBMS supports full-text indexing of text and VARCHAR fields, you might be able to just slap it in the table and let the DBMS take care of points 1 through 7 for you. But you might not want to store the full text on the DB. And it's not very challenging either, is it? smile

Steve

"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::PerlDesignPatterns)

RE: searchable pdf / word docs

(OP)
hmm some interesting food for thought.

Quote:

I think that if your DBMS supports full-text indexing of text and VARCHAR fields, you might be able to just slap it in the table and let the DBMS take care of points 1 through 7 for you.

Do you mean store the BLOB of the document and the DBMS will index the text? or is there now data typing of PDF / DOC / XLS etc?

What am I slapping into the table?

If this is an option and not difficult to do, KISS is always better than a challenge isn't it?

Why make life difficult for yourself, the re-write of the entire web app is challenging enough, anything that can give powerful enhancements with as little effort as possible is my kind of solution!

The PDF's & Word docs in question are mainly textual and less than 500 docs in total, so I'm sure I could use a parser and collect the words in the document and index accordingly, but if MS SQL 2008 R2 can do the donkey work for me, then that'll be a bonus.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music

RE: searchable pdf / word docs

(OP)
Well a quick Bing and you could be on to something Stevexff.

http://social.msdn.microsoft.com/forums/sqlserver/...

Looks like I might need to get an iFilter from Adobe to enable the full-text indexing / search to work, though apparently Word / XLS are built into to MS SQL 2008.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music

RE: searchable pdf / word docs

1DMF

Not a BLOB, but a CLOB (character large object) for the text. I think M$ SQL Server might even support an indexable text column type. You will need the filter to pull the text out of the PDF but you will only have to do that once per document when you store it. Might even be practical to store the text of the PDF and its file location or URL only on the table; once you've found the PDF with the index, then you can return that and the user can request the PDF by clicking on the URL.

You get the idea, anyway...

Steve

"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::PerlDesignPatterns)

RE: searchable pdf / word docs

(OP)
Well according to the link I posted, you store the PDF as a BLOB, import the iFilter into SQL server and then use a special join to perform a 'real-time' search of the PDF...

CODE

sql installation use :

SELECT *
FROM sys.fulltext_document_types

When you install the iFilters on the server you'll need to call the following querry in order to load the filters in the full text search engine:

EXEC sys.sp_fulltext_service 'load_os_resources', 1;
GO

EXEC sys.sp_fulltext_service 'update_languages', NULL;

Than you can search the file content using CONTAINS or CONTAINSTABLE this way

SELECT [ID],[Name],[FileContent]
FROM [MyDatabase].[dbo].[Files]
INNER JOIN 
CONTAINSTABLE ([MyDatabase].[dbo].[Files], 
([Name], [FileContent]), 
'ISABOUT( FORMSOF (INFLECTIONAL, Here goes your searched text) WEIGHT(0.9))', 
language 'English') AS res
ON res.[key]=[ID] 

I have the iFilter installed and loaded into the full-text search system, just need to store a PDF and try a search and hopefully that's be job done smile

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close