×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Extracting text from Word Documents via PHP and COM

Extracting text from Word Documents via PHP and COM

Extracting text from Word Documents via PHP and COM

(OP)
I want to extract text from Word Documents via PHP and COM and am not able to open Word in this way. I believe that com is enabled in php.ini but am not sure. Here is the code that I am using to open a new Word application:

<?php
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
$word->Documents->Open(realpath("Sample.doc"));

// Extract content.
$content = (string) $word->ActiveDocument->Content;

echo $content;

$word->ActiveDocument->Close(false);

$word->Quit();
$word = null;
unset($word);
?>

RE: Extracting text from Word Documents via PHP and COM

Assuming you are using docx format why do you need to use Com?

RE: Extracting text from Word Documents via PHP and COM

(OP)
You are implying that I do not need to use com if using .docx? Why doesn't it work with com?

RE: Extracting text from Word Documents via PHP and COM

you need to provide more information before we can tell you why com does not work.

but it is not needed for docx which is just a text format.

1. are you using windows.
2. have you completely installed a valid licensed version of ms word. does it start from the command line?
3. is com enabled in php? telling us 'it might be' does not help. go check.
4. is the script running?
5. is the script failing? if so, what is the error message (you must have error reporting and display turned on in php.ini).
6. is anything being logged to an error log at php or system level?

RE: Extracting text from Word Documents via PHP and COM

(OP)
1. are you using windows. yes 7
2. have you completely installed a valid licensed version of ms word. does it start from the command line? yes, MS Word runs fine on its own
3. is com enabled in php? telling us 'it might be' does not help. go check. - I do not know how to check- I went to php.ini but I do not know what to turn on
4. is the script running? I get a blank screen
5. is the script failing? if so, what is the error message (you must have error reporting and display turned on in php.ini).
6. is anything being logged to an error log at php or system level? no

Thank jpadie
and star this post!


RE: Extracting text from Word Documents via PHP and COM

(OP)
[COM]
; path to a file containing GUIDs, IIDs or filenames of files with TypeLibs
; http://php.net/com.typelib-file
;com.typelib_file =

; allow Distributed-COM calls
; http://php.net/com.allow-dcom
com.allow_dcom = true

; autoregister constants of a components typlib on com_load()
; http://php.net/com.autoregister-typelib
;com.autoregister_typelib = true

; register constants casesensitive
; http://php.net/com.autoregister-casesensitive
;com.autoregister_casesensitive = false

; show warnings on duplicate constant registrations
; http://php.net/com.autoregister-verbose
;com.autoregister_verbose = true

; The default character set code-page to use when passing strings to and from COM objects.
; Default: system ANSI code page
;com.code_page=

RE: Extracting text from Word Documents via PHP and COM

a blank screen suggests that the script is failing. My suspicion is that COM is not properly loaded and you have not turned on error display.
open php.ini and ensure these values:

CODE

error_reporting  =  E_ALL
display_errors = On
display_startup_errors = On
log_errors = On
log_errors_max_len = 2048 

although the manual says otherwise, from php v 5.3.15 DOTNET was not compiled in statically to the php binary so you must load it at runtime. If you are using this version of php (or later) ensure also that you have a section of your php.ini that looks like this

CODE

[PHP_COM_DOTNET]
extension=php_com_dotnet.dll 

restart your web server.

next make sure that the user under whose permissions the webserver (and thus php) are being run has permissions to access the sample document. This is a windows issue and not php. If you suspect that this is a problem it may be better to put the file in a root directory with no permissions lock on it.

change your script as follows (essentially this adds permission checks and proper footprinting).

CODE

<?php
echo '<pre>';
echo "starting\n";
set_time_limit (30); //to allow time for Word to load.
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
echo "COM instantiated\n";
$word->Application->Visible = False; 
echo "set visibility to false\n";

$doc = 'sample.doc';
$document = reapath($doc);
if (is_readable($document):
 echo "Document exists and is readable \n";
else:
 if(!is_file($document)):
   echo "Document does not exist\n";die();
 else:
   echo "Document is not readable\n"; die();
 endif;
endif;
$word->Documents->Open( $document ); 
echo "Document opened\n";
// Extract content. 
$content = $word->ActiveDocument->Content; 
echo "test\n----------\n";
print_r($content);
echo "test\n----------\n";
echo "Extracting string value of content\n";
$content = (string) $content;
echo "test\n----------\n";
echo $content;
echo "test\n----------\n";

echo $content; 

$word->ActiveDocument->Close(false); 
echo "Closed Document\n";
$word->Quit(); 
echo "Quit Word \n"
$word = null; 
unset($word); 
?> 

I say again that this is a suboptimal route for extracting text from docx files. if that is your aim then post back with some business context around your aims so that we can assist further.

RE: Extracting text from Word Documents via PHP and COM

(OP)
I made the changes to php.ini and copied your code.
I got this in the browser: "Parse error: syntax error, unexpected ':' in C:\xampp\htdocs\word7.php on line 12"
I tried to change the : to ; and got the same error but with ";".

I am trying to extract binary files from a table that contains a binary file to store .doc or .docx MS Word files. I had a VisualFoxpro database that was able to open MS word files in a table in which the documents were embedded. I converted the table to MySql and when I click on the field all I get is a .bin file that opens MS word with a few strange characters with no text.

RE: Extracting text from Word Documents via PHP and COM

a close bracket was missing

CODE

if (is_readable($document)): 

i see. i strongly suspect that the vsfp table used OLE to link and embed the document, much like the access equivalent. you will not be able to open this programmatically as a word doc. I struggled for a very very long time trying to do the same but with jpegs. the OLE cell effectively wraps the binary in its own code. you might be able to delete this with a hex editor on a file by file basis.

the solution for jpegs was to write a dll that created a small form that iteratively showed each of the records and saved the contents of the OLE Field as a normal image (in fact it was worse than that - it screenshotted a full screen rendering).

I believe that it is possible to make a better fist of it in VFP using the reportwriter class. A quick google has also brought this up, which looks promising.
http://stackoverflow.com/questions/467854/can-i-ex...

RE: Extracting text from Word Documents via PHP and COM

(OP)
I made the change: if (is_readable($document)): and now I get a blank screen.

RE: Extracting text from Word Documents via PHP and COM

(OP)
I made some changes (set visibility to true) and now get:

starting
COM instantiated
set visibility to true
Document exists and is readable
Document opened

Fatal error: Uncaught exception 'com_exception' with message '<b>Source:</b> Microsoft Word<br/><b>Description:</b> This command is not available because no document is open.' in C:\xampp\htdocs\open12.php:25
Stack trace:
#0 C:\xampp\htdocs\open12.php(25): unknown()
#1 {main}
thrown in C:\xampp\htdocs\open12.php on line 25

RE: Extracting text from Word Documents via PHP and COM

changing visibility should have made no difference at all to the php script. so something else was probably changed too. a blank screen (PROVIDING YOU HAVE ERROR DISPLAY AND REPORTING turned ON) is an unlikely scenario. after 30 seconds you would at least have got a time out error. the delay is most likely Word hanging when you tell it to open a document that is not a word document but in fact an embedded word document.

anyway ...

the fatal error indicates that the document did not open but failed. which backs up my previous post. you cannot achieve your aims using this method. check out the links in my post of 11 Sep 08h39.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close