extracting zip codes from a file? it's pulling the whole line

puns0steel · Jun 12, 2008

I'm brand new to perl, so any help would be great! I'm using ActiveState on XP. I'm trying to extract only the zip codes from an html file and put them into another file separated with line breaks or commas or something so I can put them in a spreadsheet.

Here's the code i'm using (i got it from a friend):

open(INFILE, '<', "alldata.html") or die("Could not open output file.\n");
open(OUTFILE, '>', "justzipcodes.html") or die ("Could not open output file.\n");
my $line;
while ($line = <INFILE>)
{
if ($line =~ /\b\d{5}(?:[-\s]\d{4})?\b/)
{
print OUTFILE $line;
}
}
close(OUTFILE);
close(INFILE);

it outputs to the file, but it includes the whole line of data the has the zip code, it's all hyperlinked, and there's nothing separating the data--no line breaks or anything. I'd like just a simple 5-digit zip code with no links or anything.

I also need to get rid of duplicates, but i'm guessing that's the next step.

Please help, thanks!

KevinADC · Jun 12, 2008

Sure, because you are printing $line instead of the matching part of the pattern you are looking for. Change these lines to this:

Code:

if ($line =~ /\b(\d{5}(?:[-\s]\d{4})?)\b/)
{
print OUTFILE $1,"\n";

and see if that works. We can work on the duplicates if you get this part working. Or someone will post that for you.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

puns0steel · Jun 12, 2008

thanks for replying, Kevin. so what you said makes sense, but when make the change i end up with 2100 blank lines. If i take out the "\n" and just leave

print OUTFILE $1;

i get no results. does that mean it's a problem with the regex? i was getting data with the same regex before i made the change you suggested.

KevinADC · Jun 12, 2008

post some sample data. I or someone else will take a look and evaluate your regexp based on the data.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

puns0steel · Jun 12, 2008

it's not very pretty, but here it is:

<table border="0" cellpadding="0" cellspacing="0">
<tr valign="top">
<td width="270" style="padding-right:5px;">
<a href="course_options.jhtml?zip=30314&pi=3600046&displayCategory=all&prodid=1708&classType=class&sort=distance&source=co_op1&delivery_type=&firstClassId=">Morehouse College-Sale Hall </a><br>
</td>
<td width="260" style="padding-right:5px;">
Sale Hall, GA 30314</td>
</tr>
</table>

KevinADC · Jun 12, 2008

if this is the zip code part:

zip=30314&

this is the pattern I would first try using:

Code:

if (/zip=([\d- ]+)&amp;/) {
   print $1;
}

if that proves to be too generous you can try something like:

Code:

if (/zip=(\d{5}[- ]?\d{0,4})&amp;/) {
   print $1;
}

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

puns0steel · Jun 12, 2008

thanks for the help, Kevin. I ended up using this:

use strict;
use warnings;

die "$0 <infile> <outfile>\n" unless @ARGV == 2;

open my $in, $ARGV[0] or die $!;
open my $out, '>', $ARGV[1] or die $!;

while (my $l = <$in>) {
if ($l =~ m/zip=(\d+)/) {
print $out $1, "\n";
}
}

KevinADC · Jun 12, 2008

Won't work for nine digit zip codes like 12345-6789 but maybe that is not a concern.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extracting zip codes from a file? it's pulling the whole line

puns0steel

Technical User

KevinADC

Technical User

puns0steel

Technical User

KevinADC

Technical User

puns0steel

Technical User

KevinADC

Technical User

puns0steel

Technical User

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor