Reading metatags

redstarfcs1 · Sep 21, 2005

How to read metatags without any perl module?

duncdude · Sep 21, 2005

this might help?

Code:

[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  ($name, $content) = m/name="?([^"]+)"? content="?([^"]+)"?/;
  print "$name\t$content\n";
}

[blue]__DATA__
<HEAD>
<TITLE>Stamp Collecting World</TITLE>
<META name="description" content="Everything you wanted to know about stamps, from prices to history.">
<META name="keywords" content="stamps, stamp collecting, stamp history, prices, stamps for sale">
</HEAD>[/blue]

Kind Regards
Duncan

KevinADC · Sep 22, 2005

I would maybe do it just a little bit different:

Code:

#!perl
use strict;

$_ = do {undef $/; <DATA>};
my $len = index(lc($_),'</head>');
$_ = substr $_,0,$len;

my @metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  my ($name, $content) = m/name="?([^"]+)"? content="?([^"]+)"?/;
  print "$name\t$content\n";
}

__DATA__
<HEAD>
<TITLE>Stamp Collecting World</TITLE>
<META name="description" content="Everything you wanted to know about stamps, from prices to history.">
<META name="keywords" content="stamps, stamp collecting, stamp history, prices, stamps for sale">
</HEAD>
<BODY>
bunch of stuff down here we shouldn't need to parse
</BODY>
</HTML>

no need to really check for meta tags after the closing </head> tag. Of course I would also just use ruby, could probably do it with a single character. [bigcheeks]

feherke · Sep 22, 2005

Hi

Just as a sad conclusion, Duncan, I sucked a lot with such things. Some people prefer to use single quotes in HTML tag attributes, other ones use different cases in attribute names.
[tt]
<META Name="description" Content='Everything you wanted to know about stamps, from prices to history.'>
<META NAME='keywords' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
[/tt]
So usually this catch abit better :

Code:

($sep, $name, $sep, $content) = m/name=(["']?)(.+)\1 content=(["']?)(.+)\3/i;

But stil useless if [tt]name[/tt] and [tt]content[/tt] are reversed :
[tt]
<META content='Everything you wanted to know about stamps, from prices to history.' name="description">
[/tt]
And there are cases, when a [tt]meta[/tt] tag is wrapped to multiple lines.
It is horrible what people can do to poor HTML...

Feherke.

http://rootshell.be/~feherke/

KevinADC · Sep 22, 2005

all good points feherke!

feherke · Sep 22, 2005

Hi

A question. Should be better to use the regular expression somehow like this, to exclude de currently used quotation mark from the value :

Code:

($sep, $name, $sep, $content) = m/name=(["']?)([red][^\1][/red]+)\1 content=(["']?)([red][^\3][/red]+)\3/i;

It is theory. Can someone correct it to work properly ?

Feherke.

http://rootshell.be/~feherke/

rharsh · Sep 22, 2005

This is really ugly, but it works with the data you provided and it doesn't matter which order the name and content fields are provided.

Code:

local undef $/;
my ($text) = <DATA>;
my ($headers) = $text =~ m[<head>(.*)</head>]is;

while ($headers =~ m!<META\s+(\w+)\s*=\s*['"]([^'"]+)['"]\s+(\w+)\s*=\s*['"]([^'"]+)['"]!isg) {
    print "$2 - $4", "\n";
}

feherke · Sep 22, 2005

Hi

Yes, rharsh, I was thinking to something similar. But only the actually used quote has to be excluded from the value :
[tt]
<META name="description" content="It's my site">
[/tt]
And the [tt]meta[/tt] tags could have more than two attributes :
[tt]
<META lang="en" name="description" content="It's my site">
<META lang="it" name="description" content="E il mio sito">
[/tt]
Sorry, if seems to go offtopic, but I'm really interested in a bulletproof solution ( without character-by-character parsing ).

Feherke.

http://rootshell.be/~feherke/

ishnid · Sep 22, 2005

The most reliable way to do this would be to use a proper tag-aware parser such as HTML::TokeParser::Simple. Here's a quick example to extract the attributes from <meta> tags:

Code:

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple;

# create a new parser
my $p = new HTML::TokeParser::Simple( 'test.html' );

# just want to keep track of how many meta tags we've seen
my $tagno = 1;

# loop through all the tokens in the page
while ( my $t = $p->get_token ) {
   # if we've found a meta tag
   if ( $t->is_start_tag( 'meta' ) ) {

      # print some information
      print 'Tag #'.$tagno++."\n";

      # get the attributes from the tag
      my $attrs = $t->get_attr;

      # print the attributes
      print "$_ - $attrs->{ $_ }\n" for ( keys %$attrs );

      # print a line to distinguish it from the next tag
      print '-' x 30, "\n";
   }
}

duncdude · Sep 22, 2005

Sorry feherke

I don't understand what you mean by:-

Just as a sad conclusion, Duncan, I sucked a lot with such things. Some people prefer to use single quotes in HTML tag attributes, other ones use different cases in attribute names.

Kind Regards
Duncan

KevinADC · Sep 22, 2005

nitpik ishnid:

How to read metatags without any perl module?

sorry, just couldn't resist mate! I'll go back to my cell now.

feherke · Sep 23, 2005

Hi

Duncan, your code is a good starting point, something like that was my first try too. But works well only in aseptic laboratory conditions and may fail parsing a HTML document written by others. My final conclusion was that is safer to use two regular expressions.

Feherke.

http://rootshell.be/~feherke/

duncdude · Sep 23, 2005

feherke

wow. you are rude.

Kind Regards
Duncan

KevinADC · Sep 23, 2005

I don't see it as rude, just plainly spoken.

ishnid · Sep 23, 2005

Kev - check the usernames - the OP who asked for a non-module solution has disappeared, hence the HTML::TokeParser::Simple recommendation - Gotcha!!

ChrisHunt · Sep 24, 2005

How to read metatags without any perl module?

Use eyes.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd

duncdude · Sep 24, 2005

redstarfcs1

I have spent a bit more time looking into this for you and have come up with this:-

Code:

[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  if (s/(name|content)=['"]?([^'"]+)['"]? (name|content)=['"]?([^'"]+)['"]?/if(lc($1)eq"name"){$hash{name}=$2;$hash{content}=$4}else{$hash{name}=$4;$hash{content}=$2}/ie) {
    print $hash{name} . "\t" . $hash{content} . "\n"; 
  }
}

[blue]__DATA__
<META Name="name1" Content='Everything you wanted to know about stamps, from prices to history.'>
blah blah blah
<META Content="Everything you wanted to know about stamps, from prices to history." Name='name2'>
blah blah
<META NAME='name3' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
blah blah blah blah
<META CONTENT='stamps, stamp collecting, stamp history, prices, stamps for sale' NAME="name4">[/blue]

Kind Regards
Duncan

duncdude · Sep 24, 2005

the solution above is far from perfect - but at least it does not matter in which order the name/content are

this could be adapted to be more flexible

Kind Regards
Duncan

duncdude · Sep 24, 2005

this is a bit better:-

Code:

[b]#!/usr/bin/perl[/b]

undef $/;
$_ = <DATA>;
$/ = "\n";

@metatags = m/(<meta[^>]+>)/ig;

foreach (@metatags) {
  if (s/(name|content)=['"]?([^'"]+)['"]? (name|content)=['"]?([^'"]+)['"]?/$hash{lc($1)}=$2;$hash{lc($3)}=$4/ie) {
    print $hash{name} . "\t" . $hash{content} . "\n"; 
  }
}

[blue]__DATA__
<META Name="name1" Content='Everything you wanted to know about stamps, from prices to history.'>
blah blah blah
<META Content="Everything you wanted to know about stamps, from prices to history." Name='name2'>
blah blah
<META NAME='name3' CONTENT="stamps, stamp collecting, stamp history, prices, stamps for sale">
blah blah blah blah
<META CONTENT='stamps, stamp collecting, stamp history, prices, stamps for sale' NAME="name4">[/blue]

Kind Regards
Duncan

feherke · Sep 24, 2005

Hi

Much better Duncan, I like the idea. But there are... details to solve ( I hope this was not rude ) :
[ul]
[li][tt][highlight #eeeeff]<meta name="n" [red]x="x"[/red] content="c">[/highlight][/tt] - anything else between the two captured attribute[/li]
[li][tt][highlight #eeeeff]<meta name="n" content="c[red]'[/red]mon">[/highlight][/tt] - single quote in the value[/li]
[li][tt][highlight #eeeeff]<meta name="n"
content="c">[/highlight][/tt] - wrapped to multiple lines[/li]
[/ul]

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reading metatags

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor