Regex confuses me ... 5

tbohon · Aug 22, 2007

Environment is AIX v5.2 - standard installation of Perl, system engineers and corporate security refuse to allow me to add any additional modules beyond the basic, standard installation we currently have.

With that out of the way, I was handed - literally on my way out the door today - a 'request' to provide a Perl script by COB tomorrow which will edit an extremely large delimited file. Breaking it apart into segments is no problem and the specs were just emailed to me here at home. However, on looking those specs over I see what I think is a need for some regex ... which means I'm in trouble. Every time in the past that I thought I understood regex, I was proven wrong ... and with the looming deadline, I don't have time to experiment.

Among the requirements for editing are these 4 - the only ones that I don't know how to do:

Determine if a field is all numeric

Determine if a field contains only numbers and spaces (in any order)

Determine if a field contains letters, spaces, periods and commas only (in any order)

Determine if a field contains only numbers and dashes (in a fixed pattern, e.g., SSN, phone number, etc.)

These are demographic fields on company employees, fields such as telephone number (xxx-xxx-xxxx), SSN (xxx-xx-xxxx), name (with Jr., Dr., etc.) and so forth.

I can certainly brute force this by checking each character of each string and looking at patterns, etc. but I'm sure there is a much faster way to do it using regex ... and that's what I need.

However, rather than just someone provide the code, I'd really like to learn as I go so would greatly appreciate an explanation of what's happening within the statement and add to my knowledgebase.

Any assistance is greatly appreciated and, as always, thanks in advance for your assistance.

Best,

Tom

"My mind is like a steel whatchamacallit ...

brigmar · Aug 22, 2007

All numeric can be either:
/^\d+$/ or /^\d*$/

^ represents start of record
\d represents a numeric match (0-9)
+ represents 1 or more (meaning you must have at least one number)
* represents 0 or more (meaning you can allow zero numbers)
$ represents end of record

Surrounding the regex with ^ and $ means your regex is 'anchored' against both the start and end of the record.

/\d+/ matches against 'words 1 words'
/^\d+/ wont, but will match '5 gold rings'
/\d+$/ wont match either of the above, but will match 'catch 22'

All numbers or spaces:
/^[0-9 ]+$/ (you can again replace the + with a * if you want to match an empty string)

[] represents a character set.
[0-9] is the equivalent of \d
[0-9 ] means number or space
[0-9 ]+ means one or more numbers or spaces

letters, spaces, periods and commas only
/^[A-Za-z .,]+$/

A-Z = Uppercase letter
a-z = Lowercase letter

Pretty self-explanatory

field contains only numbers and dashes (in a fixed pattern, e.g., SSN, phone number, etc.)

Phone numbers can also contain parentheses for the area code, a '+' for international dialing, spacing between number groups. They can also have separators that are not dashes (my company uses periods in their format). Oh, and there's also extensions and letter-number substitutions (e.g. 1-888-GETALL3).

But that aside...
SSN = /^\d{3}-\d{2}-\d{4}$/
{3} means 3 of whatever came before (3 digits in this case)
So you can read this as 3 digits, a hyphen, 2 digits, a hyphen, 4 digits

It can also be represented by
/^[0-9]{3}-[0-9]{2}-[0-9]{4}$/
or /^\d\d\d-\d\d-\d\d\d\d$/
or even /^[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]$/

Sometimes, people don't bother with the hyphen as a separator, so you'd want to make the hyphen optional.
You can do that by placing a ? after each hyphen.
The ? is used in the same context as + or * were previously
? = Match zero or one time
* = Match zero or more times
+ = Match one or more times
{n} = Match exactly n times
{n1,n2} = Match between n1 and n2 times
{n1,} = Match at least n1 times ( {1,} is equivalent to + )
{,n2} = Match at most n2 times

so SSN = /^\d{3}-?\d{2}-?\d{4}$/

You can apply the same logic to phone numbers, which are generally in a 3-3-4 pattern

Basic Phone/Fax = /^\d{3}[ -.]?\d{3}[ -.]?\d{4}$/

This allows the group separator to be either space, hyphen or period, and to be optional.

What about those that surround the area code with parentheses ? We use alternation, which confusingly enough uses parentheses

/(6|six) geese a laying/
will match '6 geese a laying' and 'six geese a laying'
() represents your group of options
each option is separated by the vertical line | (think 'OR')

As parentheses are characters that have special meaning in regex's (just like {}+*? previously used), they have to be 'escaped' by prefixing them with a \

/^($d{3}$|d{3})[ -.]?\d{3}[ -.]?\d{4}$/

The outer parentheses (in bold) is part of the alternation
There are two options in the alternation
Option 1 is $\d{3}$
Option 2 is \d{3} (which is as you were before)

Looking at option 1, the backslashes in $ and $ mean to match those characters as opposed to interpreting them as a group (like we have done with the outer parentheses).

I haven't gone into the '1-' prefix used for long-distance or the '+xx' prefix used for international dialling, and of course, phone numbers in different countries are in different formats, but this should be good to get you going.

KevinADC · Aug 23, 2007

That deserves at least two stars [smile]

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

tbohon · Aug 23, 2007

I agree on the two stars - what a FANTASTIC explanation of what would otherwise have been something I could use but would never understand.

THANK YOU brigmar - I appreciate it beyond words!

Best,

Tom

"My mind is like a steel whatchamacallit ...

tbohon · Aug 23, 2007

OK - one more 'little problem.

I'm testing the strings and the comparisons aren't working. I'm guessing it's the !=~ but am not sure.

Here's the code and the output:

Code:

#!/bin/perl

$allnums = "123456789";
$ssn = "123-45-6789";

if ( $allnums !=~ /^\d+$/ )
{
    print "Error:  $allnums is not an all numeric field\n";
}

if ( $ssn !=~ /^\d{3}-\d{2}-\d{4}$/ )
{
    print "Error:  $ssn is not in a valid format\n";
}
~
~
~
~
~
~
~
~
~
"t.pl" 14 lines, 248 characters 
$ t.pl
Error:  123456789 is not an all numeric field
Error:  123-45-6789 is not in a valid format

What I'm trying to do is take action only if the field doesn't meet the pattern specification, i.e., if it's OK I'll move on to the next field without doing anything. I could, of course, have an empty 'if true' block but that offends my sensibilities ...

What am I doing wrong here?

Thanks again.

Tom

"My mind is like a steel whatchamacallit ...

brigmar · Aug 23, 2007

Use !~

tbohon · Aug 23, 2007

Duh! I feel like the current series of Hyundai commercials ... the 'DUH Sale' ones ...

Thanks again.

Tom

"My mind is like a steel whatchamacallit ...

tbohon · Aug 23, 2007

brigmar:

One final question if I may ...

I'm currently using the following to trim leading and trailing spaces from a string (leaving embedded spaces within the string as they are):

Code:

    $field[0] =~ s/^[ ]+// ;
    $field[0] =~ s/[ ]+$// ;

Is there some way to combine this into one statement? Obviously what I have works and I'm headed forward with that but, now that I'm excited about regexes, I'd like to continue to learn!

Thanks again for all of the help - you've been a lifesaver today.

Best,

Tom

"My mind is like a steel whatchamacallit ...

KevinADC · Aug 23, 2007

Is there some way to combine this into one statement?

There is:

Code:

$field[0] =~ s/^\s+|\s+$//;

but don't do it like that, using one regexps in this case is actually less efficient than using two regexps. Use \s+ instead of [ ]+ to remove leading and trailing blank spaces, it's clearer to understand and I am not sure if [ ] is actually equivalent to \s, so do it like this:

Code:

$field[0] =~ s/^\s+//;
$field[0] =~ s/\s+$//;

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

tbohon · Aug 23, 2007

Thanks, Kevin ... efficiency is important, the 'truncated' test file they gave me has 262,350 records in it ... have no idea how large the 'real' file will be.

Appreciate it!

"My mind is like a steel whatchamacallit ...

brigmar · Aug 23, 2007

$field[0] =~ s/^\s*(.*)\s*$/$1/;

Need explaining ?

\s represents whitespace (so will also match on tabs)

Parentheses 'capture' what is inside them into the scalars $1, $2,$3.. etc.
The first expression inside parentheses is captured to $1, the second to $2, etc.
So, we're capturing everything that's not leading or trailing whitespace into $1, and we utilize the 2nd part of the s/// operator... replacing the whole expression with $1

Oh, and BTW, I forgot to mention before...

If you're comparing for a non-match, it might make for better readability to use 'unless' instead of 'if'.

Code:

if ( $allnums !~ /^\d+$/ )
{
    print "Error:  $allnums is not an all numeric field\n";
}

can be represented by

Code:

unless ( $allnums =~ /^\d+$/ )
{
    print "Error:  $allnums is not an all numeric field\n";
}

perl also allows you reverse the statements for more readability (when you only have one conditional statement).

So it could be changed again to

Code:

print "Error:  $allnums is not an all numeric field\n" unless ( $allnums =~ /^\d+$/ );

KevinADC · Aug 23, 2007

using Benchmark shows two regexp is about 20% faster (on my old computer), your results will vary:

Code:

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]Benchmark[/green] [red]qw/[/red][purple]timethese cmpthese[/purple][red]/[/red][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$string[/blue] = [red]"[/red][purple]    Mary had a little lamb.   [/purple][red]"[/red][red];[/red]

[black][b]my[/b][/black]  [blue]$results[/blue] = [maroon]timethese[/maroon][red]([/red][fuchsia]200000[/fuchsia], 
        [red]{[/red]
            [red]'[/red][purple]First[/purple][red]'[/red] => [url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [red]{[/red][blue]$string[/blue] =~ [red]s/[/red][purple]^[purple][b]\s[/b][/purple]*[/purple][red]/[/red][purple][/purple][red]/[/red][red];[/red][blue]$string[/blue] =~ [red]s/[/red][purple][purple][b]\s[/b][/purple]*$[/purple][red]/[/red][purple][/purple][red]/[/red][red];[/red][red]}[/red],
            [red]'[/red][purple]Second[/purple][red]'[/red] => [black][b]sub[/b][/black] [red]{[/red][blue]$string[/blue] =~ [red]s/[/red][purple]^[purple][b]\s[/b][/purple]+|[purple][b]\s[/b][/purple]+$[/purple][red]/[/red][purple][/purple][red]/[/red][red];[/red][red]}[/red]
        [red]}[/red],
    [red])[/red][red];[/red]
[maroon]cmpthese[/maroon][red]([/red] [blue]$results[/blue] [red])[/red] [red];[/red]

[tt]------------------------------------------------------------
Core (perl 5.8.8) Modules used :
[ul]
[li]Benchmark - benchmark running times of Perl code[/li]
[/ul]
[/tt]

output:

Code:

Benchmark: timing 200000 iterations of First, Second...
     First:  3 wallclock secs ( 2.48 usr +  0.00 sys =  2.48 CPU) @ 80645.16/s (n=200000)
    Second:  2 wallclock secs ( 3.01 usr +  0.00 sys =  3.01 CPU) @ 66445.18/s (n=200000)
          Rate Second  First
Second 66445/s     --   -18%
First  80645/s    21%     --

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regex confuses me ... 5

tbohon

Programmer

brigmar

Programmer

KevinADC

Technical User

tbohon

Programmer

tbohon

Programmer

brigmar

Programmer

tbohon

Programmer

tbohon

Programmer

KevinADC

Technical User

tbohon

Programmer

brigmar

Programmer

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor