Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reading metatags

Status
Not open for further replies.

redstarfcs1

Programmer
Sep 21, 2005
7
RO
How to read metatags without any perl module?
 
this might help?

Code:
[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  ($name, $content) = m/name="?([^"]+)"? content="?([^"]+)"?/;
  print "$name\t$content\n";
}

[blue]__DATA__
<HEAD>
<TITLE>Stamp Collecting World</TITLE>
<META name="description" content="Everything you wanted to know about stamps, from prices to history.">
<META name="keywords" content="stamps, stamp collecting, stamp history, prices, stamps for sale">
</HEAD>[/blue]


Kind Regards
Duncan
 
I would maybe do it just a little bit different:

Code:
#!perl
use strict;

$_ = do {undef $/; <DATA>};
my $len = index(lc($_),'</head>');
$_ = substr $_,0,$len;

my @metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  my ($name, $content) = m/name="?([^"]+)"? content="?([^"]+)"?/;
  print "$name\t$content\n";
}

__DATA__
<HEAD>
<TITLE>Stamp Collecting World</TITLE>
<META name="description" content="Everything you wanted to know about stamps, from prices to history.">
<META name="keywords" content="stamps, stamp collecting, stamp history, prices, stamps for sale">
</HEAD>
<BODY>
bunch of stuff down here we shouldn't need to parse
</BODY>
</HTML>

no need to really check for meta tags after the closing </head> tag. Of course I would also just use ruby, could probably do it with a single character. [bigcheeks]
 
Hi

Just as a sad conclusion, Duncan, I sucked a lot with such things. Some people prefer to use single quotes in HTML tag attributes, other ones use different cases in attribute names.
[tt]
<META Name="description" Content='Everything you wanted to know about stamps, from prices to history.'>
<META NAME='keywords' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
[/tt]
So usually this catch abit better :
Code:
($sep, $name, $sep, $content) = m/name=(["']?)(.+)\1 content=(["']?)(.+)\3/i;
But stil useless if [tt]name[/tt] and [tt]content[/tt] are reversed :
[tt]
<META content='Everything you wanted to know about stamps, from prices to history.' name="description">
[/tt]
And there are cases, when a [tt]meta[/tt] tag is wrapped to multiple lines.
It is horrible what people can do to poor HTML...

Feherke.
 
Hi

A question. Should be better to use the regular expression somehow like this, to exclude de currently used quotation mark from the value :
Code:
($sep, $name, $sep, $content) = m/name=(["']?)([red][^\1][/red]+)\1 content=(["']?)([red][^\3][/red]+)\3/i;
It is theory. Can someone correct it to work properly ?

Feherke.
 
This is really ugly, but it works with the data you provided and it doesn't matter which order the name and content fields are provided.

Code:
local undef $/;
my ($text) = <DATA>;
my ($headers) = $text =~ m[<head>(.*)</head>]is;

while ($headers =~ m!<META\s+(\w+)\s*=\s*['"]([^'"]+)['"]\s+(\w+)\s*=\s*['"]([^'"]+)['"]!isg) {
    print "$2 - $4", "\n";
}
 
Hi

Yes, rharsh, I was thinking to something similar. But only the actually used quote has to be excluded from the value :
[tt]
<META name="description" content="It's my site">
[/tt]
And the [tt]meta[/tt] tags could have more than two attributes :
[tt]
<META lang="en" name="description" content="It's my site">
<META lang="it" name="description" content="E il mio sito">
[/tt]
Sorry, if seems to go offtopic, but I'm really interested in a bulletproof solution ( without character-by-character parsing ).

Feherke.
 
The most reliable way to do this would be to use a proper tag-aware parser such as HTML::TokeParser::Simple. Here's a quick example to extract the attributes from <meta> tags:
Code:
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple;

# create a new parser
my $p = new HTML::TokeParser::Simple( 'test.html' );

# just want to keep track of how many meta tags we've seen
my $tagno = 1;

# loop through all the tokens in the page
while ( my $t = $p->get_token ) {
   # if we've found a meta tag
   if ( $t->is_start_tag( 'meta' ) ) {

      # print some information
      print 'Tag #'.$tagno++."\n";

      # get the attributes from the tag
      my $attrs = $t->get_attr;

      # print the attributes
      print "$_ - $attrs->{ $_ }\n" for ( keys %$attrs );

      # print a line to distinguish it from the next tag
      print '-' x 30, "\n";
   }
}
 
Sorry feherke

I don't understand what you mean by:-

Just as a sad conclusion, Duncan, I sucked a lot with such things. Some people prefer to use single quotes in HTML tag attributes, other ones use different cases in attribute names.


Kind Regards
Duncan
 
Hi

Duncan, your code is a good starting point, something like that was my first try too. But works well only in aseptic laboratory conditions and may fail parsing a HTML document written by others. My final conclusion was that is safer to use two regular expressions.

Feherke.
 
Kev - check the usernames - the OP who asked for a non-module solution has disappeared, hence the HTML::TokeParser::Simple recommendation - Gotcha!! :)
 
redstarfcs1

I have spent a bit more time looking into this for you and have come up with this:-

Code:
[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  if (s/(name|content)=['"]?([^'"]+)['"]? (name|content)=['"]?([^'"]+)['"]?/if(lc($1)eq"name"){$hash{name}=$2;$hash{content}=$4}else{$hash{name}=$4;$hash{content}=$2}/ie) {
    print $hash{name} . "\t" . $hash{content} . "\n"; 
  }
}

[blue]__DATA__
<META Name="name1" Content='Everything you wanted to know about stamps, from prices to history.'>
blah blah blah
<META Content="Everything you wanted to know about stamps, from prices to history." Name='name2'>
blah blah
<META NAME='name3' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
blah blah blah blah
<META CONTENT='stamps, stamp collecting, stamp history, prices, stamps for sale' NAME="name4">[/blue]


Kind Regards
Duncan
 
the solution above is far from perfect - but at least it does not matter in which order the name/content are

this could be adapted to be more flexible


Kind Regards
Duncan
 
this is a bit better:-

Code:
[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  if (s/(name|content)=['"]?([^'"]+)['"]? (name|content)=['"]?([^'"]+)['"]?/$hash{lc($1)}=$2;$hash{lc($3)}=$4/ie) {
    print $hash{name} . "\t" . $hash{content} . "\n"; 
  }
}

[blue]__DATA__
<META Name="name1" Content='Everything you wanted to know about stamps, from prices to history.'>
blah blah blah
<META Content="Everything you wanted to know about stamps, from prices to history." Name='name2'>
blah blah
<META NAME='name3' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
blah blah blah blah
<META CONTENT='stamps, stamp collecting, stamp history, prices, stamps for sale' NAME="name4">[/blue]


Kind Regards
Duncan
 
Hi

Much better Duncan, I like the idea. But there are... details to solve ( I hope this was not rude ) :
[ul]
[li][tt][highlight #eeeeff]<meta name="n" [red]x="x"[/red] content="c">[/highlight][/tt] - anything else between the two captured attribute[/li]
[li][tt][highlight #eeeeff]<meta name="n" content="c[red]'[/red]mon">[/highlight][/tt] - single quote in the value[/li]
[li][tt][highlight #eeeeff]<meta name="n"
content="c">[/highlight][/tt] - wrapped to multiple lines[/li]
[/ul]


Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top