Challange the experts on string handling 2

awingnut · Jul 26, 2006

I have been racking my brain to come up with an efficient way to parse a string. I am hoping those more experienced then me can come up with something. I suspect a regexp is the answer if I can come up with the right template but I am not a regexp expert.

I need to parse a string that contains a date (which can have a couple of formats), followed by a location which will have spaces and commas. For example:

$string="1510, Hamm, Westphalia, Germany";

or

$string="1 OCT 1961, New York, N.Y.";

I think the key is to find the year reliably, then separate the 2 strings. What I ultimately need is the following:

$date="1510";
$place="Hamm, Westphalia, Germany";

or

$date="1 OCT 1961";
$place="New York, N.Y.";

The only thing that can be reliabily expected is a 4 digit year, which ends the date part, followed by a comma if there is a location or no comma if there is no location. Can someone help? TIA.

ishnid · Jul 26, 2006

Given the data you've posted (plus one I've added in for when there's no comma (and therefore no location):

Code:

my @strings=("1510, Hamm, Westphalia, Germany",
        "1 OCT 1961, New York, N.Y.",
        'december 4th, 1904' );


for ( @strings ) {

   my ( $date, $location ) = /(.*?\d{4}) ?(?:,(.+)$)?/;

   $location ||= ''; # avoids warnings

   print "Date: [ $date ] Location: [ $location ]\n";
}

I should mention that if the date cannot contain a comma, you can best achieve this by using split.

awingnut · Jul 26, 2006

Thanks. Let me understand this a little before I try it. First is it not clear (I'm not a perl expert) how the regexp knows what to work on. I guess it is a special featur of the for loop but I'll need it to work on a variable. Anyway, let me test my meager regexp knowledge:

'.*' means 1 character followed by 0 or more characters. How can this work with just a year? I must be wrong.

'?' Not sure what this does but I guess it is somehow looking for the comma?

'\d{4}' 4 decimal characters

The parens makes this one string assigned to $date

'?' Looking for the comma again but I don't know how

The rest is too cryptic probably because I don't understand the '?' in regexp.

ishnid · Jul 26, 2006

Here it is commented, so hopefully it'll be a little clearer. Any questions, just ask!

Code:

my ( $date, $location ) = /
   (        # start first capturing group (date)
     .*?    # any amount of anything (non-greedy)
     \d{4}  # followed by 4 digits (the date
   )        # close first capturing group
   \s?      # optional space
   (?:      # start non-capturing group
      ,     # a comma
      (     # start second capturing group (location)
        .+  # any amount of anything
      )     # end second capturing group
      $     # second capturing group must be at the end of the string
   )        # end non-capturing group
   ?        # non-capturing group is optional (no location)
/x;         # end the regexp (allowing comments)

Rieekan · Jul 26, 2006

Have a star for the regexp. That's a good one.

- George

awingnut · Jul 26, 2006

That is great. Thanks for taking the time.

I think one thing that confused me is the '.'. I thought that meant at least 1 character but maybe the following '?' fixes that. The trailing '?' means that it is an optional substring? So what is the difference between '.*?' and just '*'?

The '?:' is what makes it non-capturing? Or is it the ':' and the '?' means it can be optional? If the latter why is it at the start of the group rather then the end like in the first group? (As you can see I am having trouble with the '?' and reading the docs didn't help :-( )

The '$' then caputures everthing to the end of the string?

Final question. Where does the string that it is working on come from? Is it @_ by default? Suppose the string is in a variable?

Thanks again.

stevexff · Jul 26, 2006

I'll answer the easy one: the regex matches against $_ by default. Probably best to leave the rest to ishnid [smile]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

ishnid · Jul 26, 2006

I'll answer the last question first.

Code:

my ( $date, $location ) = /(.*?\d{4}) ?(?:,(.+)$)?/;

. . . is shorthand for . . . 

my ( $date, $location ) = $_ =~ /(.*?\d{4}) ?(?:,(.+)$)?/;

The regexp is applied to the special $_ variable by default. Inside the for loop, $_ is set to each element in the @strings array in turn.

The `$' character doesn't capture anything. It specifies that whatever comes before it must be located at the end of the string (i.e. with nothing else after it).

Normally, the `?' character means "zero or one of whatever is before it". To use that definition, we ignore the concept of non-capturing groups for the moment.

1? == zero or one `1' characters.
and? == an `a', an `n' and an optional (i.e. zero or one) `d' at the end
(and)? == zero or one occurance of the word `and'. The parentheses there group the characters together.

Now for .*? - it's all down to how `greedy' the regexp is. Consider the following string: "abcdabcdabcd":

.* means "any amount of anything" (literally zero or more occurrences of any character - the dot matches any character, the star is the quantity). This is a "greedy" match, which means it matches as much as possible

a.*d - matches "abcdabcdabcd" (i.e. the whole string, from the first `a' to the last `d')

Putting a `?' afterwards makes it non-greedy, so:
a.*?d - mataches "abcd" - from the first `a' to the first `d' following it.

Finally, we move onto groups. Remember what I had earlier:
(and)?

We used the parentheses to group the three characters together. When we do it like this, it gets captured - this means that it is stored and returned by the regexp. In the regexp I posted, there are two capturing groups, one of which is assigned to $date and the other to $location.

Sometimes we want to group characters but don't want to capture them for later. Here, we use a non-capturing group. To do this, we put `?:' after the opening parenthesis. In this case the meaning of `?' is TOTALLY unrelated to what I've discussed above. So to group the word `and' without capturing it for later, we do:
(?:and)

Hope that clears some things up for you.

awingnut · Jul 26, 2006

It sure does and thanks for that explaintion. Much clearer then reading the regexp docs. A star for you.

awingnut · Jul 26, 2006

Make that 2 stars.

awingnut · Jul 26, 2006

Sorry but I'm having a problem in the case where there is no location part. It seems that any time I try to use $location, I get an error saying it is uninitialized. How do I handle that to avoid the error? TIA.

ishnid · Jul 26, 2006

The second line in the for loop I originally posted will prevent $location being uninitialised. Can you post the code you're currently using, along with some sample data that's causing problems?

awingnut · Jul 27, 2006

Once again, thanks for your help. I pretty much copied and pasted what you proposed. Here is the code segment in question:

Code:

my $string="NOV 1939";
($date,$place)=$string=~/(.*?\d{4}) ?(?:,(.+)$)?/;
print length($date)," - ",length($place),"\n";

ishnid · Jul 27, 2006

Yes, you're missing the line I had after the regexp:

Code:

$location ||= ''; # avoids warnings
# or in your case:
$place ||= '';

That was put in there to avoid the uninitialised warnings.

awingnut · Jul 27, 2006

Duh! Sorry. I completely missed that. Thanks.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Challange the experts on string handling 2

awingnut

Programmer

ishnid

Programmer

awingnut

Programmer

ishnid

Programmer

Rieekan

Programmer

awingnut

Programmer

stevexff

Programmer

ishnid

Programmer

awingnut

Programmer

awingnut

Programmer

awingnut

Programmer

ishnid

Programmer

awingnut

Programmer

ishnid

Programmer

awingnut

Programmer

Similar threads

Part and Inventory Search

Sponsor