Get "http://......" address from a HTML file"

preethib · Aug 16, 2000

HiTeks,

I am trying to get a list of all actions(http anchors) into text file from an HTML file.

HTML file:
.............
<A href="

http://www.yahoo.com">http://www.yahoo.com/index/in=1&login.html</A>

.............
<A href="

http://www.abc.com">http://www.abc.com/cgi-bin?q=car</A>

..........
<A href="

http://www.av.com">http://www.av.com</A>

...
etc.

I would like to extract just the http locations into a log file which looks like this.

LOG file:

http://www.yahoo.com/index/in=1&login.html

http://www.abc.com/cgi-bin?q=car

http://www.av.com

..

using a regular expression and a split function since I have same http addresses occuring twice in an action line. I don't know how to match/write an regexp for special characters like '&, ?, =,etc' they are irregular not all http address have these in my list. I am wondering if I need to put '\/' '\?' '\=' for all the special characters or what? and how?

Please help,

Preethi
preethib@yahoo.com

goBoating · Aug 17, 2000

This might be what you are looking for.....
[tt]
#!/usr/local/bin/perl -w
$str = 'several lines of text
with a few html anchors like <A HREF="

http://www.cpan.org">

Label for the first anchor</A> and <A HREF="

http://www.perl.org">Perl

Mongers</a>
and a little more stuff.';

while ($str =~ /<A HREF="(.*?)">/gs)
[tab]{
[tab]print "Matched $1\n";
[tab]}
[/tt]

Regex explain:
[tt]
/<A HREF="(.*?)">/gs
| | | | s - treat new line chars as regular chars
| g - preform match repeatedly (globally)
(.*?) - parens catch match in $1
/ | | | ? - match minimally - without '?' match would
| | be first '<A' to last '>'
| * - any number of chars
. - any char

'hope this helps

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Get "http://......" address from a HTML file"

preethib

Programmer

goBoating

Programmer

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Get &quot;http://......&quot; address from a HTML file&quot;

preethib

Programmer

goBoating

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

Get "http://......" address from a HTML file"