Extract all hyperlinks from a Word 2010 Document
Extract all hyperlinks from a Word 2010 Document
(OP)
I would like to Extract all hyperlinks from a word Document to list them all in one document.
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS Come Join Us!Are you a
Computer / IT professional? Join Tek-Tips Forums!
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail. Posting Guidelines |
Extract all hyperlinks from a Word 2010 Document
|
Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.
Here's Why Members Love Tek-Tips Forums:
Register now while it's still free!
Already a member? Close this window and log in.
RE: Extract all hyperlinks from a Word 2010 Document
However, unless you can tell us what issues you're having doing this, how the document is formatted, etc, it's hard to be sure if the following will work:
• Use Ctrl-A, then mark all text as hidden. If it disappears click on the ¶ symbol on the toolbar/ribbon to make it all visible again.
• Using Find/Repace, do a Find for all text in the Hyperlink Style, setting the Replace parameter to 'Not Hidden'
• Using Find/Repace, do a Find for all hidden text, setting the Replace parameter to ^p
• Using a wildcard Find/Repace, delete the 'hidden text' setting and do a Find for [^13]{1,}, setting the Replace parameter to ^p
What you should end up with is a list of all hyperlinks in the document. All of the above assumes your hyperlinks are formatted as such, with the Hyperlink Style.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
I guess I don't work in word enough to understand what you are asking me for. What do you mean formatted? It's just a typical word doc with hyperlink attached to text.
I am also not sure how to set the replace to Not Hidden or hidden text.
In the find box how to I do find for all text in the Hyperlink Style? Is there a special code? Thank you in advance as we have over 2000 hyperlinks that we need to index at the end.
Using Find/Repace, do a Find for all text in the Hyperlink Style, setting the Replace parameter to 'Not Hidden'
• Using Find/Repace, do a Find for all hidden text, setting the Replace parameter to ^p
• Using a wildcard Find/Repace, delete the 'hidden text' setting and do a Find for [^13]{1,}, setting the Replace parameter to ^p
RE: Extract all hyperlinks from a Word 2010 Document
The rest is simply a matter of learning to use the options available to you on the Find/Replace dialogue. You may need to click on the 'More' button to access them, especially the 'Format' options you'll need to use.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
RE: Extract all hyperlinks from a Word 2010 Document
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
Once we completed all the steps we saved the word document as a XML document and was able to open it with excel, so we have a list of the targets (hyperlink as true value, pdfs)
Hope I can make a macro to do all the steps.
thank you again.
RE: Extract all hyperlinks from a Word 2010 Document
Here's a macro to do the job:
CODE --> VBA
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
I'm very far from a Word VBA Guru, but would this macro not be a bit simpler? you get out a clean word doc with all the hyperlinks listed in paragraphs.
CODE
RE: Extract all hyperlinks from a Word 2010 Document
Your code might be 'simpler', but it's far less efficient once you get beyond a few hyperlinks. FWIW, for all its extra lines, my code does all the extraction, even in a document with 100,000 hyperlinks, in four simple steps. Your's would probably still be running hours after mine has finished.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
CODE
'Private Declare Function GetTickCount Lib "kernel32" () As Long Public Sub GetHyperlinks() Dim myDoc As Document Dim wombat As Hyperlink ' Dim starttime As Long Dim CurrentDoc As Document Application.ScreenUpdating = False Set CurrentDoc = ActiveDocument Set myDoc = Application.Documents.Add() ' starttime = GetTickCount For Each wombat In CurrentDoc.Hyperlinks myDoc.Range.InsertAfter wombat.TextToDisplay & vbTab & wombat.Address & vbCrLf Next ' Debug.Print GetTickCount - starttime Application.ScreenUpdating = True myDoc.Range.ParagraphFormat.TabStops.Add CentimetersToPoints(7.5), wdAlignTabLeft, wdTabLeaderSpaces 'basic formatting End Sub
Furthermore, an actual test of your assertion on performance (against a 234 page document with over 8000 hyperlinks) indicates that the contrary is true - performance of jpadie's solution (or at least my variant above) starts to convincingly outstrip the find/replace solution as the number of hyperlinks goes up.
RE: Extract all hyperlinks from a Word 2010 Document
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
I tried to experiment with storing the targets in a string and then finally inserting into a new document. I tested on a file with 280000 hyperlinks across 8000 pages and got bored after ten minutes (so force quit the app). in the meantime I wrote a php app to open the raw xml and retrieve the hyperlinks. that op takes milliseconds...
i know that VBA is not a real language but i'm still really surprised by how badly optimised it is. Luckily I never have to use it for anything other than the most trivial things.
RE: Extract all hyperlinks from a Word 2010 Document
I did
RE: Extract all hyperlinks from a Word 2010 Document
CODE
Sub ExtractHyperlinks() Dim starttime As Long Application.ScreenUpdating = False starttime = GetTickCount With ActiveDocument.Range .Font.Hidden = True With .Find .ClearFormatting .Replacement.ClearFormatting .Forward = True .Wrap = wdFindContinue .Format = True .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False .Style = "Hyperlink" .Text = "" .Replacement.Text = "" .Replacement.Font.Hidden = False .Execute Replace:=wdReplaceAll .ClearFormatting .Font.Hidden = True .Replacement.Text = "^p" .Execute Replace:=wdReplaceAll .ClearFormatting .Text = "[^13]{1,}" .Execute Replace:=wdReplaceAll End With End With Debug.Print GetTickCount - starttime Application.ScreenUpdating = False End Sub
RE: Extract all hyperlinks from a Word 2010 Document
it isn't really VBA itself that is the culprit with your code, it is the fact that you are using relatively expensive (slow) Word operations: Collapse and InsertParagraph.
RE: Extract all hyperlinks from a Word 2010 Document
I wrote an alternative that just stored the addresses in a string and didn't write it anywhere (so no 'expensive' calls). I quit the app again after 25 minutes running on the same document (8k pages 200k+ hyperlinks).
Ho hum ...
RE: Extract all hyperlinks from a Word 2010 Document
RE: Extract all hyperlinks from a Word 2010 Document
RE: Extract all hyperlinks from a Word 2010 Document
RE: Extract all hyperlinks from a Word 2010 Document
I concede your point re the ultimate aim being to extract the addresses (something I hadn't picked up from pattyjean's last post), whereas my code was designed to preserve the hyperlinks as such.
FWIW, I tested a document containing 100,000 hyperlinks amongst 4,735,000 words spread over 19,003 pages. The optimised loop code to extract the addresses to a new document took 00:04:04, whereas the optimised F/R to delete everything except the hyperlinks took 00:07:53. I also tried an optimised loop to copy the hyperlinks to a new document, I gave up waiting after 01:30:00, by which time only 1/3rd of them had been processed.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
Well after trail and error I found out that an easy way to do the same thing is same the word document as xml and when you open it in excel it gives you a clean column named target to easily identify all the linked documents.
Now that I have this part of the process complete, the next step is to match up the link names with the friendly name (excel formula = hyperlink() It enables me to rename the links into the text name but ........coping it back into the word document is the new challenge for me. Any ideas? I am going to post this into another category if it makes sense to you all.
RE: Extract all hyperlinks from a Word 2010 Document
So you have a set of hyperlinks in Word and, in Excel, a corresponding set of hyperlinks in one column and their 'friendly' names in another, and you want the Word hyperlinks to display the 'friendly' names. Corect? If so, that's easily enough done. A couple of questions, though:
1. Are the hyperlinks in Word & Excel listed in the same order?
2. Are there any duplicates or instances or the same hyperlink with two or more 'friendly' names?
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
Thanks for the response, in answer to your questions I have to give you the whole picture.
There are 91 different word documents with 2000 attachments in pdf format.
Each word document contains the hyperlinks but at the end of the document we want to add a list of evidence with the list of hyperlinks and their friendly names.
All the hyperlinks are in one folder with the word documents outside the folder.
But the final document will be a pdf version with all sets of clickable links. So after the link of evidence the word document will be saved as pdf. The hyperlinks are on a drive and will be saved to flash drives.
1. Are the hyperlinks in Word & Excel listed in the same order? Could be haven't set it up yet.
2. Are there any duplicates or instances or the same hyperlink with two or more 'friendly' names? No each hyperlink might be multiple documents but the same friendly name.
Does that answer some of your questions?
RE: Extract all hyperlinks from a Word 2010 Document
If you're hyperlinking to documents, I think you'll find the hyperlinks will have the full filepaths, including drive letters, etc for the target files. So, when you do your PDF conversion, that's what'll be replicated in the PDF. If you then copy the files to a USB stick or CD and open them on another computer, the hyperlinks will still be looking for the original filepaths on your computer and, in all likelihood, will fail.
As for the "list of evidence with the list of hyperlinks and their friendly names", that suggests some form of table, but it's not clear how the 'list of evidence' entries are to be compiled and matched with the hyperlinks. Also, it seems to me you don't need both the 'hyperlinks and their friendly names'. Rather, you should be able to have the hyperlinks displaying only their friendly names.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
If you then copy the files to a USB stick or CD and open them on another computer, the hyperlinks will still be looking for the original filepaths on your computer and, in all likelihood, will fail.
The way we linked them and it works is to have the folder of attachments on each flash drive and in the word document it is linked like attachments\filename.pdf. It works for the current links but the
List of Evidence is a different story. I don't want to link each one separately. I used the list of hyperlinks from the word document so I have the name of them already from the other step mentioned above. I match them up with the friendly name for each document in an excel table and use the hyperlinks function. The problem is - how do I copy the friend name with the link to paste into the word document. It brings over the path of the excel file instead of the real link. I need some kind of function to keep together the pdf with the friendly name. Any clue? Does this make sense?
RE: Extract all hyperlinks from a Word 2010 Document
Even if your hyperlinked files are on a flash drive, by default they'll include the drive's letter in Word. Put the flash drive into another PC where it gets assigned a different drive letter and the hyperlinks will fail.
It's still not clear what you intend regarding the 'List of Evidence'. It is easy enough to modify the Word hyperlinks so they display the friendly names in the body of the document, rather than the actual paths, whilst hovering over them will display the actual paths. To that end, you don't need a separate 'List of Evidence'. if you want one, though, perhaps what you need is an Index to provide that list.
Cheers
Paul Edstein
[MS MVP - Word]
RE: Extract all hyperlinks from a Word 2010 Document
RE: Extract all hyperlinks from a Word 2010 Document
We can save the word to pdf and maintain the links in pdf, but when we click to open the hyperlink it takes us to the attachment but on close, it closes everything.
For more detail, I found this same issue here: http://forums.adobe.com/message/4005350
There are 3 ways we can do this:
1: Change the setting in pdf to 'Open cross-document links in same window' unchecked in Edit>Preferences>Documents (works great but the SACS reviewer would have to follow these steps also. (we are using Adobe X, don’t know what version they would use)
2: We can ask the SACS reviewer to hold Ctrl and enter to open the pdf in a new window, or
3: Can you deploy a configuration file (autorun) to add to the flash drive so they can just click the link and it opens in a new window?
We want to make this as easy as possible for them to review on a flashdrive. Can you give us any advice or help with deployment? Is it possible?
Thanks for any information you can provide. This would have to work in a MAC environment as well.
RE: Extract all hyperlinks from a Word 2010 Document
You have 3000 attachments, or 3000 links?
It's still not clear to me how the hyperlinked content in the body of a given document is intended to relate to the 'index', which apparently uses the 'friendly' name. Doesn't the 'friendly' name get used as the display text in the body also? If not, how is a user meant to recognise which 'friendly' name in the 'index' relates to a given hyperlink in the body?
It's also still not clear as to how the 'index' is to be compiled. Is the idea to go through all the hyperlinks in the body, find the corresponding entries in the Excel workbook, then insert the 'friendly name' hyperlinks into the 'index'? What happens if the same hyperlink is found more than once? Should the 'index' entries be sorted and, if so, how?
PS: I've been away for a fwe weeks, hence the delay in replying.
Cheers
Paul Edstein
[MS MVP - Word]