Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

extracting zip codes from a file? it's pulling the whole line

Status
Not open for further replies.

puns0steel

Technical User
Jun 12, 2008
4
I'm brand new to perl, so any help would be great! I'm using ActiveState on XP. I'm trying to extract only the zip codes from an html file and put them into another file separated with line breaks or commas or something so I can put them in a spreadsheet.

Here's the code i'm using (i got it from a friend):

open(INFILE, '<', "alldata.html") or die("Could not open output file.\n");
open(OUTFILE, '>', "justzipcodes.html") or die ("Could not open output file.\n");
my $line;
while ($line = <INFILE>)
{
if ($line =~ /\b\d{5}(?:[-\s]\d{4})?\b/)
{
print OUTFILE $line;
}
}
close(OUTFILE);
close(INFILE);


it outputs to the file, but it includes the whole line of data the has the zip code, it's all hyperlinked, and there's nothing separating the data--no line breaks or anything. I'd like just a simple 5-digit zip code with no links or anything.

I also need to get rid of duplicates, but i'm guessing that's the next step.

Please help, thanks!
 
Sure, because you are printing $line instead of the matching part of the pattern you are looking for. Change these lines to this:

Code:
if ($line =~ /\b(\d{5}(?:[-\s]\d{4})?)\b/)
{
print OUTFILE $1,"\n";

and see if that works. We can work on the duplicates if you get this part working. Or someone will post that for you.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
thanks for replying, Kevin. so what you said makes sense, but when make the change i end up with 2100 blank lines. If i take out the "\n" and just leave

print OUTFILE $1;

i get no results. does that mean it's a problem with the regex? i was getting data with the same regex before i made the change you suggested.
 
post some sample data. I or someone else will take a look and evaluate your regexp based on the data.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
it's not very pretty, but here it is:



<table border="0" cellpadding="0" cellspacing="0">
<tr valign="top">
<td width="270" style="padding-right:5px;">
<a href="course_options.jhtml?zip=30314&amp;pi=3600046&amp;displayCategory=all&amp;prodid=1708&amp;classType=class&amp;sort=distance&amp;source=co_op1&amp;delivery_type=&amp;firstClassId=">Morehouse College-Sale Hall </a><br>
</td>
<td width="260" style="padding-right:5px;">
Sale Hall, GA 30314</td>
</tr>
</table>
 
if this is the zip code part:

zip=30314&amp;

this is the pattern I would first try using:

Code:
if (/zip=([\d- ]+)&amp;/) {
   print $1;
}

if that proves to be too generous you can try something like:

Code:
if (/zip=(\d{5}[- ]?\d{0,4})&amp;/) {
   print $1;
}

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
thanks for the help, Kevin. I ended up using this:


use strict;
use warnings;

die "$0 <infile> <outfile>\n" unless @ARGV == 2;

open my $in, $ARGV[0] or die $!;
open my $out, '>', $ARGV[1] or die $!;

while (my $l = <$in>) {
if ($l =~ m/zip=(\d+)/) {
print $out $1, "\n";
}
}
 
Won't work for nine digit zip codes like 12345-6789 but maybe that is not a concern.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top