Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

nested regex

Status
Not open for further replies.

mwpclark

Programmer
Mar 14, 2005
59
US
I am editing a perl module to parse data that has some variations:

Example 1
<div class="text">32779 Kudo Dr<br>Mission, BC V2V 6T5<br>(604) 826-9091</div>

Example 2
<div class="text">Surrey, BC<br>(604) 951-3777</div>

The second example does not have an address or the following <br>.

This regex works for the first example, but creates boo-boos with the second:

'regexp' => 'textb".*?=">(.*?),\n\t+(.*?)</a>.*?\n.*?=">.*?[COLOR=red yellow]<div class="text">(.*?)<br>(.*?), (\w\w)(.*?)<br>(.*?)</div>\n[/color]'

I have made several attempts at a nested expression to find, or NOT to find, the address<br>, but no luck yet.

Also, the missing address does need to be represented as a value to match the numbering system.

'data_names' => ['surname', 'name', 'address', 'city', 'state', 'zipcode', 'phone' ],
'order' => [ '1', '2', '3', '4', '5', '6', '7' ],

Thanks
Mike
 
Attempting to use your own regexps for parsing HTML is a bad idea. You should have a look at one of the proper tag-aware modules such as HTML::TokeParser or (my favourite) HTML::TokeParser::Simple.
 
The way I would do that:
-catch everything between the [tt]div[/tt] tags
-split on [tt]<br>[/tt]
-guess the type of information for every field from format (easy for phone number, not necessarily possible for an address or name field)

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Let me elaborate a bit:

[highlight]<div class="text">(.*?)<br>(.*?), (\w\w)(.*?)<br>(.*?)</div>\n'[/highlight]

works for

<div class="text">32779 Kudo Dr<br>Mission, BC V2V 6T5<br>(604) 826-9091</div>


[highlight]<div class="text">(.*?), (\w\w)(.*?)<br>(.*?)</div>\n'[/highlight]

would work for

<div class="text">Surrey, BC<br>(604) 951-3777</div>

They both begin and end with div tags, but the first has an extra value and <br>.

I am looking for a boolean-type OR regex phrase that will let me grab one or the other, whichever it finds, PLUS would insert a blank (value) for the missing address.

thx
 
The point I'm making is that a regexp is not the best (or even particularly good) way to to about this type of task, given the high-quality HTML parsers that are available for free use on CPAN.

For instance, your regexp would break if somebody put a newline instead of a space between "<div" and "class" (perhaps someone using a text editor that word-wraps like in that way). That would be exactly equivalent from a HTML point of view but would not pass your regexp. Also, if someone cleans up the code to be XHTML compliant, your <br> tags would become <br/>, which would break your regexp too. Newlines in the addresses could break it too, depending on how you're using it.

 
This isn't the most elegant of solutions, and I haven't looked at the modules that ishnid has suggested. Off the top of my head, though, you could do this:

Code:
#!/usr/bin/perl
use strict;

extract("<div class='text'>32779 Kudo Dr<br>Mission, BC  V2V 6T5<br>(604) 826-9\
091</div>");
extract("<div class='text'>Mission, BC  V2V 6T5<br>(604) 826-9091</div>");

sub extract {
    my ($div) = @_;

    print "div: $div\n";
    $div =~ s/<div .*?>//;
    $div =~ s/<\/div>//;

    $div =~ /(.*)<br>(.*)/;
    my $addr_city = $1; my $phone = $2;

    my $addr = "";
    my $city = $addr_city;
    if ($addr_city =~ /<br>/) {
        $addr_city =~ /(.*)<br>(.*)/;
        $addr = $1; $city = $2;
    }

    print "addr: $addr\n";
    print "city: $city\n";
    print "phone: $phone\n\n";
}

--
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top