Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Searching all occurances of images in an HTML page. 3

Status
Not open for further replies.

valentino

Technical User
Aug 30, 2001
1
US
Hi all,
I have a bunch of HTML files but I need to spit out a list of all the images tag within them.

Anyone know how to go about it?

I tried this but it just doesnt work right.

open(LOOKFILE,"myfile.html") or die ("couldn't open");
while (<LOOKFILE>){
/<img (.*)>/i;
print &quot;$1\n&quot; if $1;
}
close(LOOKFILE);

Gary &quot;thank in advance&quot; Haran
 

/<img\s+([^>]*)>/i;



Disclaimer:
Beware: Studies have shown that research causes cancer in lab rats.
 
mbaranski,

Ok. That looks good, but can you explain the code for me.

Thanks,
Gary
Gary M. Gordon, LLC
webmaster@garymgordon.com
Certified Web Developer ::
Application Programmer
 
/<img\s+([^>]*)>/i;

Check for start of img tag '<img' and then a space '\s'. then everything up to the end character '>' and whatever is after it and do this case insensitive '/i'.

HTH. Good one.

Tiz --
Tim <tim@planetedge.co.uk>
 
tiz,

Could you break that down for me such as ...

\s+([^>]*)>/


\s+ tells it what?

Then ... character by character after that ... please explain what each piece is doing.

I need more help ... if you can explain this.

Thanks,
Gary
Gary M. Gordon, LLC
webmaster@garymgordon.com
Certified Web Developer ::
Application Programmer
 
Okay,

/<img means the pattern won't match unless it contains <img at least.

\s means a white space character. So, a whitespace character must follow <img which is correct according to the HTML DTD.

+ one or more times. So, at least one white space character (\s+).

([^>]*) searches for the end '>' to confirm that what you have is actually an <img ... > tag.

HTH,

Tiz

--
Tim <tim@planetedge.co.uk>
 
Ok, ... we're getting there. :)

([^>]*)

I kind of understand the what ever is placed between ( and ) is to be remembered.

But, can you explain (as you did with \s and the rest of what you explained ... each of the remaining characters.

Thanks,
Gary
Gary M. Gordon, LLC
webmaster@garymgordon.com
Certified Web Developer ::
Application Programmer
 
hehehe, okay, you really want this bit by bit don't you!

([^>]*)


the [^>] matches one occurrence of any character inside the brackets [], so this will match one occurrence of > at the end of the <img..> tag if the match is true.

The parentheses () act as a grouping operator and whatever is held in the parenthesis can recalled later using $1, $2 etc. So, ([^>]*) holds the result of [^>] in $1 for example.

* means 0 or more times occurred.

Tiz



--
Tim <tim@planetedge.co.uk>
 
Ok. That was good.

Now ... regarding

([^>]*)


Questions:

1) Why is the * on the inside of the ( )?

2) Are the ( ) really necessary? Or could they have been omitted?

3) I would have thought that the following might have worked.

/<img\s+\w+>/i;

What would be wrong with the above??

PS: I still don't understand the ([^>]*) portion too well. Can you possibly explain this any clearer ... on a very beginners basis??

Gary M. Gordon, LLC
webmaster@garymgordon.com
Certified Web Developer ::
Application Programmer
 
Hi Gary,

To make this less confusing, let's change the regexp to:

/<img\s+(.*)>/i;

Which should work fine.

This means we're matching to see if the line contains '<img' followed by one or more spaces and then followed by everything up to the ending '>'. Everything is represented by the .* (similar to DOS wildcards for example) and because we've asked for everything after img to be stored in $1 (by using the brackets) we can check $1 to see if an occurrence has been found. e.g.

If the img tag in our html file was:

<img src=&quot;blah&quot; width=2 height=3>

$1 would contain:

src=&quot;blah&quot; width=2 height=3

Also, this tutorial may help:

Hope this helps, regexp's are quite difficult to understand but very powerful.

Tiz --
Tim <tim@planetedge.co.uk>
 
Gary, I gave a character-by-character description of this same regular expression to another person in thread 219-69684. Check that out. There is a problem with the description above. The part of the re ([^>]*) matches zero or more chars that are NOT >. The ^ as the first character in a character class NEGATES the class. What this does is pick up all characters from the whitespace to the closing > and groups them together. This is a common notation when you are matching strings with a closing delimiter - use a character class of &quot;not the closing delimiter&quot;, zero or more times (or one or more times, as required), followed by the closing delimiter. Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
I'm not sure, but if I copy the link and paste it here's what results, so try clicking on this:

thread219-69684 Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
One question ...

What is the purpose (or requirment) to using [ ] to create a character class? What is the use of this, as opposed to just using:

1. /<img\s+^>*>/i;

or

2. /<img\s+(^>*)>/i;


instead of the example shown:

/<img\s+([^>]*)>/i;


WHAT'S THE DIFFERENCE IN THE ABOVE 2 OPTIONS I LISTED??

Gary
Gary M. Gordon, LLC
webmaster@garymgordon.com
Certified Web Developer ::
Application Programmer
 
The NOT character ^ only works inside a character class, and only as the first character, otherwise it's part of the character class. It's purpose is to negate the whole class. A character class lets you specify a group, or range, or both, or characters, any ONE of which should match (you still need to specify after the class how many times it should match). In this case the class consists of just a single character, but it could also be something like [0-9a-fA-f] (matches one hex digit). The single-character class is used here because it's the only way (that I know of) to say &quot;any character BUT this one&quot;.

In the other two options you asked about the ^ would be treated as a literal character.
Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top