Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

reg.exp. *? bug?

Status
Not open for further replies.

domster

Programmer
Oct 23, 2003
30
GB
Hi, I've been having trouble extracting elements from XML-ish files using regular expressions. Say I want to extract all <dat1> elements from the following data:

<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1><dat3>bad</dat3><dat1>format</dat1></thing>

I use the code:

while ($thing =~ s/<dat1>.*?<\/dat1>/g) {
[do stuff with dat1 element]
}

I know, in the example above, I could use [^<]* but in some cases, there are nested tags in the elements I'm extracting, so I use .*?. However, for some reason, only the final element is located using this code. Is this a bug in Perl? (I'm using ActiveState Perl v5.8.3) Anyone else discovered this and found a good way around it? Thanks for all your time.
 
First, I should say there are lots of modules for parsing XML for you. But, this version of your code seems to work.

Code:
my $str = '<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1>';
$str .= '<dat3>bad</dat3><dat1>format</dat1></thing>';

while ($str =~ m/<dat1>.*?<\/dat1>/gc) {
	print $&, "\n";
}

If you want just the stuff between the <dat1> and </dat1> tags, put () around the .*? and change $& to $1.
 
Thanks, that seems to do the trick! You need the m/ operator and the c at the end of the regex. Weird! By the way, I know there are modules for parsing XML, but as this isn't well-formed XML, I haven't found one that would be appropriate - unless you can recommend one?
 
You shouldn't be using $& if at all possible. In this situation, it's perfectly avoidable by, as rharsh suggests, putting parens around the .*? and using $1. From perlvar:
$&
The use of this variable anywhere in a program
imposes a considerable performance penalty on all
regular expression matches. See BUGS.
 
Code:
my $str = '<thing><dat1>This</dat1><dat2>is</dat2><dat1>a</dat1><dat3>bad</dat3><dat1>format</dat1></thing>';

@matches = $str =~ m/(<dat1>[b][red][^<]*[/red][/b]<\/dat1>)/g;

print join ("\n", @matches);


Kind Regards
Duncan
 
Yes, but no good if there are any other tags nested in the required data.
 
I thought that somewhere in the 5.8.x chain, the penalty of using $& was diminished or removed. $' and $` are still bad, though.

________________________________________
Andrew
 
Code:
my $str = '<thing><dat1>This</dat1><dat2>is</dat2>[b]<dat1><dat2><dat3>a</dat3></dat2></dat1>[/b]<dat3>bad</dat3><dat1>format</dat1></thing>';

while ($str =~ m|(<dat(1)>.*?</dat\2>)|g) {
  print "$1\n";
}


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top