substr versus regex

TrojanWarBlade · Sep 15, 2005

Anyone fancy a heated debate?
Well I thought I'd warm the place up a bit! ;-)
I've noticed that some of you guys like to use substr instead of regexs wherever you can. I guess this is generally a performance issue and in this respect I agree.
My issue here is that I think that speed is rarely that important or critical in these situations and I think that regexs are easier for people to understand.
Did I just say that?

Yep!
I don't mean that regexs are easier to understand, I mean that regex are often used to alter a scalar. In many of the cases I've seen here we have used "substr" to alter scalars.
The problem is that most people don't realise that substr can alter scalars. The general perception is that substr returns a section of a scalar and this idea has probably been generated by using other languages where that is indeed the case.
The problem as I see it is that for a (possibly modest and maybe even irrelevant) performace gain you risk creating code that is far less maintainable, not because it's bad but because it is subject to a common misconception.

There, I've said it. I've started the fire so let's see how many of you guys want to fan the flames!

What do you think? Agree? Disagree? We'll see.

Trojan.

duncdude · Sep 15, 2005

I agree. Often i have seen a post where someone is asking a question that can be solved, in part, by utilising substr or a regex. Almost always it becomes a kind of necessity to use substr if at all possible - with the emphasis on speed over a regex. Very rarely is the data so vast (in fact it is usally a few hundred records) to require such optimised code. Even split gets a hard time - and that can be immensely useful!

And on top of that:-

1) If the OP required blistering performance - why would you use an interpreted language?
2) Machines are so powerful these days that what once would be a speed issue - is unlikely to be so today

Kind Regards
Duncan

TrojanWarBlade · Sep 15, 2005

Duncdude!
You agreed with me!
I was trying to start a fire here and now it's fizzling out! ;-)
Don't worry, I'm sure someone will want to put me in my place.

Trojan.

ishnid · Sep 15, 2005

IMO the decision of which to use is down to whether you're looking for *patterns* of things or *numbers* of things. If you're looking for the "fifth to the seventh character" regardless of what they are, use substr. If the sixth character needs to be numeric, then a regexp is vastly preferable to nested substrs.

I prefer to use complex combinations of indexes, rindexes and substrs if I can avoid it. If I want to insert into a string "after the second occurrence of a 'b' character (for example), I'll use a single regexp rather than using a couple of indexes to find where the 'b' is and then using substr. On the other hand, if I want to insert into a string "at position 4", it'll be substr. Of course, anything you can do with a substr can be achieved using a regexp. I personally find that a single substr is more elegant than a regexp and the performance gain is no harm either.

On the issue of the misconception that substr can't be used to alter a string, I'd have two suggestions.

1) Make it clear that a replacement is happening by using the lvalue version of substr:

Code:

substr( $string, 3, 4 ) = 'the string to insert';

2) If the misconception is a result of using other languages, then the person subject to the misconception probably isn't particularly familiar in regexps anyway, since Perl programmers tend to rely on them far more than programmers in other languages (for instance, I *hate* using regexps in Java because their implementation is so frustrating, compared to Perl's).

Of course, that's all IMO - you can use whichever you like, really.

TrojanWarBlade · Sep 15, 2005

I think one of the common dangers with substr is when it's used for breaking up a record.
I've seen many examples where you have a record of fixed width fields:
aaaaaaaabbbbccccccddddeeeeeeeeff
The fields widths would then be as follows:
8|4|6|4|8|2
and the field selection code would often be substrs with absolute positions:

Code:

my $field1 = substr($_,0,8);
my $field2 = substr($_,8,4);
my $field3 = substr($_,12,6);
my $field4 = substr($_,18,4);
my $field5 = substr($_,22,8);
my $field6 = substr($_,30,2);

Now if the structure of that record changes and a field is inserted (or indeed removed or altered in length), all the offset values must be changed accurately.
If, on the other hand, we used "unpack" we could use field widths and then only one value would need to be changed.

Code:

my (@fields) = unpack("a8 a4 a6 a4 a8 a2", $_);

Does this make any sense?

BTW: with respect to Ishnid's point 2, many people have progressed through more than one language in their lives and I bet you're one of them! Consequently it's not unusual to find people that consider themsevles adequate developers that would still have this misconception.
I have taught many people and come across this problem time and time again and these people are often perfectly happy with simple regexs since they used them in sed and awk for example. (sorry Ishnid, no offence intended here, it's just one of my experiences).

Trojan.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

substr versus regex

TrojanWarBlade

Programmer

duncdude

Programmer

TrojanWarBlade

Programmer

ishnid

Programmer

TrojanWarBlade

Programmer

Similar threads

Part and Inventory Search

Sponsor