×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Using XFRX to Extract PDF's Text

Using XFRX to Extract PDF's Text

Using XFRX to Extract PDF's Text

(OP)
Hi,

Does anyone know if the XFRX sdk can create a searchable pdf from an image when outputting to pdf?

Is there a function that will return the text output from the OCR process available so it can be inserted into a table for searching?

Why is OCR mentioned here? Because if I load a non-searchable pdf in Acrobat, I have to run the OCR process to make it searchable, and it is this text I'm trying to programmatically extract to a var.

The answer and a lengthy discussion can be found within convoluted thread: https://www.tek-tips.com/viewthread.cfm?qid=182064...

This thread is a continuation of this subject matter from that thread and also serves as a link.

My apologies goes out to Griff for hijacking his thread... Sorry buddy... I'll start new threads on all my future questions, no matter how small they may be.

Thanks,
Stanley



Thanks,
Stanley

RE: Using XFRX to Extract PDF's Text

If the PDF contains an image, there is no way of searching it, because you can only search for text. That's true even if there is text contained within the image.

There are, however, a number of tools available for converting PDFs to searchable text. One that I have used is Nitro PDF Pro, which can convert from PDF to Microsoft Word. It's not free, but there is a free trial period. It doesn't always do a great job of preserving the formatting, but that's not an issue if your aim is to search the text.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: Using XFRX to Extract PDF's Text

It might help to know what your overall goal is here. Do you need to search for any arbitrary text anywhere in a PDF? Or is the search more structured - searching for documents with specific invoice numbers or customer names, for example? If the latter, then it might make more sense to store the searchable text in an ordinary table. The user would search the table in the normal way, and that in turn would give a pointer to the corresponding PDF.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: Using XFRX to Extract PDF's Text

Let me take this over from thread184-1820648: XFRX - Are newer versions any faster... to this thread, your own thread.

Quote (stanlyn)

You keep saying I should know what the text is
Maybe it's clear by now I wasn't detecting these questions came from you, but thought they came from Griff.
Griff was generating PDFs from data, so he knows his texts.

I also wrongly addressed this to Griff:

Quote (me)

Or what did you use so far? Printing to a TIFF file that is by default just images of the pages and then converting that to PDF?

That is exactly what you have. Then you only have images in your PDF. Take the topic aside of how PDFs are generated by VFP FRXes without or without tools like FoxyPreviewer or XFRX. Because those PDFs don't come from FRXes, they have a completely different origin and generation process. A scanner generating a PDF will usually only embed the scanned images into a PDF, nothing else. TIFF, even simpler is just a bunch of images, they are then also embedded into a PDF and that's it. Some scanners also come with OCR capabilities, but it's questionable whether they then combine their OCR capability and PDF generation capability.

As ever so often, why don't you try yourself to see whether you can search in a PDF file of your scanned documents within a PDF reader? Or whether you can select text in it? Only if that's given it's even viable to try to also do such things programmatically with any tools. But XFRX is not the place to start.

You can see from the name XFRX it's all about processing FRXes to other output formats, not the other way. It's not about PDFs text extraction in the first place. It's even unexpected it has that reader feature. That makes sense, if you know how to embed text into a PDF file you also can offer the reverse. It's not a given or natural to provide that in a tool that's mainly concerned to act as FRX converter to other formats, though.

Martina already gave you the answersof what XFRX can and can't do for you. From my comments on TIFF and scanning you could also already have deduced those originas prodce PDFs with images and not text. So from such PDFs you can't expect being able to read text, only images.

And one last thing, you asked atlopes:

Quote (Stynln)

I looked and did not see pdf2xml or pdftotext utilities in the VFP9 help. Can you be more specific about their location?
Atlopes posted links, at least now his post contains links to pdf2xml and pdftotext. I think it always were links. So are you not aware that if a word or text is underlined blue it is a link in a tek-tips post? Click on them.

Chriss

RE: Using XFRX to Extract PDF's Text


Quote (Chris Miller)

You can see from the name XFRX it's all about processing FRXes to other output formats, not the other way. It's not about PDFs text extraction in the first place. It's even unexpected it has that reader feature. That makes sense, if you know how to embed text into a PDF file you also can offer the reverse. It's not a given or natural to provide that in a tool that's mainly concerned to act as FRX converter to other formats, though.

Yes. Base functional is converts VFP reports to some output - most often PDF.
But PDF#READER has four basic functions:
- Read informations about PDF object, because XFRX supports append mode (add output from VFP report to existing file) for PDF.
- Extract images
- Extract attachments
- Read page's content

mJindrova

RE: Using XFRX to Extract PDF's Text

I'm curious.

What kind of application must extract text produced AFTER the pdf is created but cannot be extracted BEFORE pdf is created (as suggested previously). Either way (OCR or program code) it would seem to me a separate file of text or some other type would be produced anyway.

Just curious.

Steve

RE: Using XFRX to Extract PDF's Text

Well, Steve, two of those options have been mentioned:

The origin of the PDF is a TIFF file which only contains images
The origin of the PDF is a scanner - some sscanners actually come with a scan-to-PDF button and the software coming with it will turn the scanned pages to PDF. Which again is usually images only.

And indeed, as Griffs thread was all about PDF generation from an FRX, that puzzled me, too, but this thread isn't about FRXes. The problem case is you have a bunch of PDFs and think of the more general case the origin of the PDFs would be unknown, you would rather work on them with any PDF specific tool than with an FRX specific tool. Stanlyn only asked because XFRX was mentioned in Griffs thread.

Chriss

RE: Using XFRX to Extract PDF's Text

Quote (Chris)

think of the more general case the origin of the PDFs would be unknown

Thanks Chris for explaining. I hadn't thought of that case.

Steve

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close