nested regex

mwpclark · Sep 8, 2008

I am editing a perl module to parse data that has some variations:

Example 1
<div class="text">32779 Kudo Dr<br>Mission, BC V2V 6T5<br>(604) 826-9091</div>

Example 2
<div class="text">Surrey, BC<br>(604) 951-3777</div>

The second example does not have an address or the following <br>.

This regex works for the first example, but creates boo-boos with the second:

'regexp' => 'textb".*?=">(.*?),\n\t+(.*?)</a>.*?\n.*?=">.*?[COLOR=red yellow]<div class="text">(.*?)<br>(.*?), (\w\w)(.*?)<br>(.*?)</div>\n[/color]'

I have made several attempts at a nested expression to find, or NOT to find, the address<br>, but no luck yet.

Also, the missing address does need to be represented as a value to match the numbering system.

'data_names' => ['surname', 'name', 'address', 'city', 'state', 'zipcode', 'phone' ],
'order' => [ '1', '2', '3', '4', '5', '6', '7' ],

Thanks
Mike

ishnid · Sep 8, 2008

Attempting to use your own regexps for parsing HTML is a bad idea. You should have a look at one of the proper tag-aware modules such as HTML::TokeParser or (my favourite) HTML::TokeParser::Simple.

prex1 · Sep 8, 2008

The way I would do that:
-catch everything between the [tt]div[/tt] tags
-split on [tt]<br>[/tt]
-guess the type of information for every field from format (easy for phone number, not necessarily possible for an address or name field)

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

mwpclark · Sep 8, 2008

Let me elaborate a bit:

[highlight]<div class="text">(.*?)<br>(.*?), (\w\w)(.*?)<br>(.*?)</div>\n'[/highlight]

works for

<div class="text">32779 Kudo Dr<br>Mission, BC V2V 6T5<br>(604) 826-9091</div>

[highlight]<div class="text">(.*?), (\w\w)(.*?)<br>(.*?)</div>\n'[/highlight]

would work for

<div class="text">Surrey, BC<br>(604) 951-3777</div>

They both begin and end with div tags, but the first has an extra value and <br>.

I am looking for a boolean-type OR regex phrase that will let me grab one or the other, whichever it finds, PLUS would insert a blank (value) for the missing address.

thx

ishnid · Sep 8, 2008

The point I'm making is that a regexp is not the best (or even particularly good) way to to about this type of task, given the high-quality HTML parsers that are available for free use on CPAN.

For instance, your regexp would break if somebody put a newline instead of a space between "<div" and "class" (perhaps someone using a text editor that word-wraps like in that way). That would be exactly equivalent from a HTML point of view but would not pass your regexp. Also, if someone cleans up the code to be XHTML compliant, your <br> tags would become <br/>, which would break your regexp too. Newlines in the addresses could break it too, depending on how you're using it.

sycoogtit · Sep 8, 2008

This isn't the most elegant of solutions, and I haven't looked at the modules that ishnid has suggested. Off the top of my head, though, you could do this:

Code:

#!/usr/bin/perl
use strict;

extract("<div class='text'>32779 Kudo Dr<br>Mission, BC  V2V 6T5<br>(604) 826-9\
091</div>");
extract("<div class='text'>Mission, BC  V2V 6T5<br>(604) 826-9091</div>");

sub extract {
    my ($div) = @_;

    print "div: $div\n";
    $div =~ s/<div .*?>//;
    $div =~ s/<\/div>//;

    $div =~ /(.*)<br>(.*)/;
    my $addr_city = $1; my $phone = $2;

    my $addr = "";
    my $city = $addr_city;
    if ($addr_city =~ /<br>/) {
        $addr_city =~ /(.*)<br>(.*)/;
        $addr = $1; $city = $2;
    }

    print "addr: $addr\n";
    print "city: $city\n";
    print "phone: $phone\n\n";
}

--

http://MaladorSoftware.com

http://MyUniversityTutor.com

http://MyUniversityBooks.com

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

nested regex

mwpclark

Programmer

ishnid

Programmer

prex1

Programmer

mwpclark

Programmer

ishnid

Programmer

sycoogtit

Programmer

Similar threads

Part and Inventory Search

Sponsor