Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Get "http://......" address from a HTML file"

Status
Not open for further replies.

preethib

Programmer
Jul 20, 2000
39
IN
HiTeks,

I am trying to get a list of all actions(http anchors) into text file from an HTML file.

HTML file:
.............
<A href=&quot;.............
<A href=&quot;..........
<A href=&quot;...
etc.

I would like to extract just the http locations into a log file which looks like this.

LOG file:
..

using a regular expression and a split function since I have same http addresses occuring twice in an action line. I don't know how to match/write an regexp for special characters like '&, ?, =,etc' they are irregular not all http address have these in my list. I am wondering if I need to put '\/' '\?' '\=' for all the special characters or what? and how?

Please help,

Preethi
preethib@yahoo.com
 
This might be what you are looking for.....
[tt]
#!/usr/local/bin/perl -w
$str = 'several lines of text
with a few html anchors like <A HREF=&quot;Label for the first anchor</A> and <A HREF=&quot; Mongers</a>
and a little more stuff.';

while ($str =~ /<A HREF=&quot;(.*?)&quot;>/gs)
[tab]{
[tab]print &quot;Matched $1\n&quot;;
[tab]}
[/tt]

Regex explain:
[tt]
/<A HREF=&quot;(.*?)&quot;>/gs
| | | | s - treat new line chars as regular chars
| g - preform match repeatedly (globally)
(.*?) - parens catch match in $1
/ | | | ? - match minimally - without '?' match would
| | be first '<A' to last '>'
| * - any number of chars
. - any char

'hope this helps
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top