reg.exp. *? bug?

domster · May 24, 2004

Hi, I've been having trouble extracting elements from XML-ish files using regular expressions. Say I want to extract all <dat1> elements from the following data:

<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1><dat3>bad</dat3><dat1>format</dat1></thing>

I use the code:

while ($thing =~ s/<dat1>.*?<\/dat1>/g) {
[do stuff with dat1 element]
}

I know, in the example above, I could use [^<]* but in some cases, there are nested tags in the elements I'm extracting, so I use .*?. However, for some reason, only the final element is located using this code. Is this a bug in Perl? (I'm using ActiveState Perl v5.8.3) Anyone else discovered this and found a good way around it? Thanks for all your time.

rharsh · May 24, 2004

First, I should say there are lots of modules for parsing XML for you. But, this version of your code seems to work.

Code:

my $str = '<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1>';
$str .= '<dat3>bad</dat3><dat1>format</dat1></thing>';

while ($str =~ m/<dat1>.*?<\/dat1>/gc) {
	print $&, "\n";
}

If you want just the stuff between the <dat1> and </dat1> tags, put () around the .*? and change $& to $1.

domster · May 24, 2004

Thanks, that seems to do the trick! You need the m/ operator and the c at the end of the regex. Weird! By the way, I know there are modules for parsing XML, but as this isn't well-formed XML, I haven't found one that would be appropriate - unless you can recommend one?

ishnid · May 24, 2004

You shouldn't be using $& if at all possible. In this situation, it's perfectly avoidable by, as rharsh suggests, putting parens around the .*? and using $1. From perlvar:

$&
The use of this variable anywhere in a program
imposes a considerable performance penalty on all
regular expression matches. See BUGS.

duncdude · May 24, 2004

Code:

my $str = '<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1><dat3>bad</dat3><dat1>format</dat1></thing>';

@matches = $str =~ m/(<dat1>[b][red][^<]*[/red][/b]<\/dat1>)/g;

print join ("\n", @matches);

Kind Regards
Duncan

domster · May 25, 2004

Yes, but no good if there are any other tags nested in the required data.

icrf · May 28, 2004

I thought that somewhere in the 5.8.x chain, the penalty of using $& was diminished or removed. $' and $` are still bad, though.

________________________________________
Andrew

duncdude · May 29, 2004

Code:

my $str = '<thing><dat1>This</dat1><dat2>is</dat2>[b]<dat1><dat2><dat3>a</dat3></dat2></dat1>[/b]<dat3>bad</dat3><dat1>format</dat1></thing>';

while ($str =~ m|(<dat(1)>.*?</dat\2>)|g) {
  print "$1\n";
}

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

reg.exp. *? bug?

domster

Programmer

rharsh

Technical User

domster

Programmer

ishnid

Programmer

duncdude

Programmer

domster

Programmer

icrf

Programmer

duncdude

Programmer

Similar threads

Part and Inventory Search

Sponsor