Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Counterintuitive Regexp 2

Status
Not open for further replies.

Nebbish

Programmer
Apr 7, 2004
73
US
Hello,

Here is a simple regular expression that isn't working quite like I would expect. I think I know what's going on, but I'm not sure if I like it ;)
Code:
my $string = "/hello/this/is/my/folder/file.txt";
$string =~ /\/(.*?)$/;
print $1;

Notice that I'm grabbing as little as possible with the .*?, and I'm anchoring to the right end of the string. Since it's grabbing as little as possible with a "/" just to the left, I thought this would print just "file.txt". However, it's printing "hello/this/is/my/folder/file.txt".

I think what's happening is its finding the "/" first, then attempting to find a .*? match after that. Shouldn't Perl keep scanning to see if it can reduce the .*? any more?
 
You are correct on what's happening. Unfortunately it does take a peek at the subtleties of the regexp engine to understand why. I'll try and break it down.

The .*? construct does not force perl to check all the possible matches it can find. If this were the case, regexps would be extremely slow when running on relatively large strings. Rather, it chages the approach the regexp engine takes when trying to find a match. In order to make this example clearer, I'm going to change your regexp a little and try to match another forwardslash (rather than the end of the string).

Greedy matching:
$string =~ /\/(.*)\//;

The engine starts at the beginning of the string, looking for the first character it's asked to match ( a '/'). It finds it as the first character in the string. It then sees that it must match "as many as possible of anything", followed by another '/'. Since the .* is greedy, it starts at the end of the string to look for the '/' and goes backwards until it hits the slash before file.txt. It has now found a match and returns the result.

Non-greedy matching:
$string =~ /\/(.*?)\//;

Once again, the engine finds the first '/' at the start of the string. This time, however, it's asked to then match "as little as possible of anything", followed by another '/'. Because the match is non-greedy, rather than starting at the end of the string, it continues at the next character (the 'h' in 'hello') and goes forward until it finds a '/'. It finds the one before 'this' and returns that match.

If you apply that logic to your regexp, you'll see why it behaves as it does.

Now you see why some people avoid .* ?

A better regexp to do what you're looking for would be to expressly state that you don't want to match any '/' characters in your match:
Code:
$string =~ /\/([^\/]*)$/;

Or even better, use the standard File::Basename module:
Code:
use File::Basename;
my $filename = basename($string);
 
Ishnid,

Yeah, your solution with the [^\/] is what I ended up doing. Didn't know there was a module available for such things, though...good to know.

Thanks for the rundown on how it works. I didn't realize lazy and greedy just described where the search started.

Nebbish
 
due to regex specifically looking for no forward slashes the regex looks neater with pipes to describe the start and end of the match (as opposed to / which requires escaping)

my $string = "/hello/this/is/my/folder/file.txt";
$string =~ m|([^/]+)$|;
print $1;



Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top