Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Meta tag parsing (first try at a designing a mod)

Status
Not open for further replies.

spydermonkey

Programmer
May 24, 2004
31
US
This IS reinventing the wheel but it's a learning process as this is my first attempted module. It mimics LWP in the way you use meta_gather($url) to extract the source code, but specifically just the meta tags.

This worked FINE before I added all the IF tests when I used $count like $array[$count] = "$1::$2";. But after I added all the tests to insert things in the proper order (as outlined with the # numbers) no results are displayed (just a bunch of new lines).

Can someone help me figure out the bug? PerlGoodies is a temp name btw, lol, it will be changed later.

ALSO, how do I export variables back into the script calling it? The @meta_results won't be printed in the module, it's for testing, but I need to pass it to the script somehow so the user can do whatever they want with this array. Can someone help me out with that?

Code:
package PerlGoodies;
use Exporter;
@ISA = 'Exporter';
@EXPORT_OK = qw(meta_gather);
use strict;


sub meta_gather($)
{

use LWP::Simple;
require HTTP::Status;


my($url) = @_;

my $p_content = get($url);

my @meta_results;
my $count = 0;


#1 = description
#2 = keywords
#3 = abstract
#4 = author
#5 = robots
#6 = distribution
#7 = language
#8 = rating
#9 = copyright
#10 = distributor

  while($p_content =~  /<meta\s+name=\"(.+?)\"\s+content=\"(.+?)\">/ig) 
 {
      $count++;


      if($1 =~ /^description$/i)
      {
         $meta_results[1] = "$2";
      }
      elsif($1 =~ /^keywords$/i)
      {
         $meta_results[2] = "$2";
      }
      elsif($1 =~ /^abstract$/i)
      {
         $meta_results[3] = "$2";
      }
      elsif($1 =~ /^author$/i)
      {
         $meta_results[4] = "$2";
      }
      elsif($1 =~ /^robots$/i)
      {
         $meta_results[5] = "$2";
      }
      elsif($1 =~ /^distribution$/i)
      {
         $meta_results[6] = "$2";
      }
      elsif($1 =~ /^language$/i)
      {
         $meta_results[7] = "$2";
      }
      elsif($1 =~ /^rating$/i)
      {
         $meta_results[8] = "$2";
      }
      elsif($1 =~ /^copyright$/i)
      {
         $meta_results[9] = "$2";
      }
      elsif($1 =~ /^distributor$/i)
      {
         $meta_results[10] = "$2";
      }
  }

foreach (@meta_results) { print "$_\n";}

return;
}

1;

__END__
 
Hi Spydermonkey.

Sorry, I don't have more to offer as a solution,
but simply adding a temp variable right after your
count++ statement to hold the value of $2 seems to work.

$count++;
$a2=$2;

and then use it like this

elsif($1 =~ /^robots$/i)
{
$meta_results[5] = "$a2";
}

Not sure why the intial value of $2 is being lost though.
 
I'd be thinking that the regexes (pattern matches) in the if statements are overwriting the $2 variable

--Paul
 
Paul's right. Each time it goes through the loop
[tt]
while ($p_content =~ /<meta\s+name=\"(.+?)\"\s+content=\"(.+?)\">/ig)
[/tt]
the name attribute will go into $1 and the content into $2 (as you expect). However when it gets to
[tt]
if($1 =~ /^description$/i)
[/tt]
that's a new pattern match, and all the special variables, including $1 & $2, will be cleared down. Like crackn101 says, you need to copy $2 (and $1) into temp variables next to the [tt]$count++;[/tt] line.

I would make two other observations: It seems rather long winded to have that multi-legged if statement, why not use a loop something like this (untested)?:
Code:
my @meta_results;
my $count = 0;
my @names = ("description",
             "keywords",
             "abstract",
             "author",
             "robots",
             "distribution",
             "language",
             "rating",
             "copyright",
             "distributor");
my $name;
my $content;
my $i;

while($p_content =~  /<meta\s+name=\"(.+?)\"\s+content=\"(.+?)\">/ig) {
   $count++;
   $name = $1;
   $content = $2;

   for ($i = 0; $i <= $#names; $i++) {

      if($name =~ /^$names[$i]$/i) {
         $meta_results[$i] = $content;
         last;  # exit the for loop, as we've found a match
      }
   }
}

Secondly, be aware that there are other ways to write a meta tag that your module won't (currently) pick out...
[tt]
<meta name="language" content="eng" />
<meta name="language" scheme="ISO 639-2/T" content="eng">
<meta content="eng" name="language">
[/tt]
So I'm afraid you need to be more sneaky with your regexes!


-- Chris Hunt
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top