Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need Urgent Help with Perl Lookup & Substitution Routine 1

Status
Not open for further replies.

EricSilver

Technical User
Mar 18, 2008
19
US
Hello,

I am a new user having a problem getting what should be a simple routine to work.

What I am doing is opening an address file, and a state name file, so I can change state abbreviations, i.e. “AZ” to full state names, i.e. “Arizona.”

After the address file is opened, the state name file is opened. This file consists of three fields: 1.( Unique identifier; 2.) State Abbreviation; 3.) Full state name.

The routine compares the address file state abbreviation data value to the abbreviation field value in the state name file. If it matches, the Address File abbreviation data element is changed to the State Name File full name data element.

If the Address File abbreviation is not in the state name file, I want the routine to print “error” in place of the full state name. Unfortunately, that part is not working. Instead of printiong "ERROR" for one Address file record, it prints "ERROR" for all of them. Any assistance would be appreciated. Here is what I have:


## FILE LOCATIONS
$file='/File.txt'; ## THE ADDRESS FILE
$maplocation='/Map.txt'; ## THE STATE NAME LOOKUP FILE
$file2='File2.txt'; ## THE MODIFIED ADDRESS FILE


## OPEN ADDRESS FILE AND ADD CONTENTS TO AN ARRAY
open(FILE,"<$file")||die "Could not open $file";
@file=<FILE>;
close FILE;

## FOR EACH RECORD IN THE ADDRESS FILE, DO THE FOLLOWING
foreach $line (@file) {
@data=split(/t/,$line);

## CREATE VARIABLES CORRESPONDING TO ADDRESS FILE DATA
## (This step is not really necessary)
$d0=$data[0];
$d1=$data[1];
$d2=$data[2];
$d3=$data[3];
$d4=$data[4];
$d5=$data[5];
$d6=$data[6];
$d7=$data[7
$d8=$data[8];
$d10=$data[10];

## OPEN STATE NAMES FILE AND ADD CONTENTS TO AN ARRAY
open(MAP,"<$maplocation");
@entries = <MAP>;
close MAP;

## FOR EACH RECORD IN THE ADDRESS FILE, DO THE FOLLOWING
foreach $line2 (@entries) {
@fields=split(/,/,$line2);

## COMPARE ADDRESS FILE STATE ABBREVIATION DATA TO STATE FILE ABBREVIATION DATA. (THIS CODE WOKS PERFECTLY)

if ($d8 eq $fields[1]) {$d8=$fields[2]};

## $d8 is the address file abbreviation value; $fields[1]
## is the State File abbreviation value; and $fields[2] is
## the state file full name value.

## IF ADDRESS FILE STATE ABBREVIATION IS NOT PRESENT IN STATE NAME FILE, PRINT ERROR (THIS CODE FAILS):

if ($d8 eq $fields[1]) {$d8=$fields[2]} else {$d8=”error”};
}

## INSTEAD OF PRINTING "ERROR" FOR ONE RECORD, IT PRINTS ERROR FOR ALL OF THEM.

## WRITE OUTPUT TO FILE
$line= ”$d0”.”$d1”.”$d2”.”$d3”.”$d4”.”$d5”.”$d6”.”$d7”.”$d8”.”$d9”.”$d10”."\n";

};

open(DATA,">$file2");
print DATA (@file);
close DATA;
 
... what I prefer:
$data[8] = (exists $states{$data[8]}) ? $states{$data[8]} : 'Error';

You are mistakenly assigning $desc to $data[8], which must be the last value from the maplocation file, "Wyoming".

Works perfectly now! Thanks so much for all the good feedback.

I Will eventually need to apply this lookup logic to additional files, but I do not anticipate too many problems. :)
 
Out of curiosity, would this also work with wildcards?

For example, if I wanted all state abbreviations that began with "A" -- AZ, AL, AK, AR, -- all return "Arizona" as the $data[d8] value.

Would this code accommodate the use of "." or "^" and other wildcard characters?

Code:
$data[8] = (exists $states{$data[8]}) ? $states{$data[8]} : 'Error';
 

Actually, let me clarify that.

If a state abbreviation field had one or more extra characters, i.e. "AZ" was written as AZX in the lookup file, and I wanted make sure it was interpreted as "AZ" using a wildcard, could that work?

I dont see how "^AZ" could be incorporated into this code.

Code:
$data[8] = (exists $states{$data[8]}) ? $states{$data[8]} : 'Error';
 
It sounds possible but I guess the trick is to make it work for all states. Probably the best time to do that though would be while reading in the file that has the state abbreviations, not afterwards.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
It sounds possible but I guess the trick is to make it work for all states. Probably the best time to do that though would be while reading in the file that has the state abbreviations, not afterwards.

Exactly what I am thinking, between lines 5 and 6 below.

Code:
open(MAP,"<$maplocation")||die "Could not open $maplocation";
    while (<MAP>) {
        chomp;
        ($code, $abbrev, $fullname)=split /::/, $_;
        $states{$abbrev}=$fullname;
    }
close MAP;
 
If the only problem is extra characters and not something else:

Code:
        ($code, $abbrev, $fullname)=split /::/, $_;
        $abbrev = substr($abbrev,0,2); 
        $states{$abbrev}=$fullname;

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
By "something else" what do you mean? More than one or two extra characters?

For ewxample, would the code you just submitted apply if the $abbrev field was: "The quick brown fox" and the comparison field was: "The quick brown fox chases cars"?

 
A hash uses key-value pairs. It works by taking the key, running it through a hashing algorithm to produce a number, and then using that number as an index to store the value at a memory location. This explains why hash lookups are so fast, and also why the keys function doesn't guarantee what order they will be returned in. It also means that they don't support any kind of wildcarding as you must have the whole key to support the hash lookup.

Your best bet is to standardise the key in some way before you store it in the hash, and also before you look it up. So for example you could take the first two characters only (as KevinADCs example shows) and convert them to upper case (as I did in my original post) to make the process more robust.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
By "something else" what do you mean? More than one or two extra characters?

By "something else" I meant you are not needing to do anything except get the first two letters for the state abbreviation. The code I posted just returns the first two characters in the $abbrev variable, so if those first two characters can not safely be used to create the hash key my suggestion would not work. But if you just have stuff like AZX instead of AZ than you will be fine.

I would also convert the state "keys" to all lower case or all upper case as Steve mentions above to normalize the hash keys so you always know what you are working with: AZ or az for example, instead of some being AZ and some being az or whataver.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I plan to get back to it later today, using something like this (looks OK here, but won't know for sure until I try it):

Code:
open(MAP,"<$maplocation")||die "Could not open $maplocation";
    while (<MAP>) {
    chomp;
    ($code, $abbrev, $fullname)=split /::/, $_;

     $ab = “$abbrev”;
     $ab =~ s/$ab/$ab#WILDCARD CHAR#/; #Everything after the characters in $ab is ignored/considered valid
     $abbrev = $ab;
     $states{$abbrev}=$fullname;
    }
close MAP;
 
Is there a means of editing/deleting posts? In my previous post, I have this backwards:

Code:
$ab = “$abbrev”;
$ab =~ s/$ab/$ab#WILDCARD CHAR#/;

Since $abbrev is the reference value, what I have there is wrong. The submitted value ( $data[8] ) is what needs to be wildcarded.
 
You can't delete or edit posts. Make sure to use the "Preview Post" button and check your posts for any errors or changes before finally clicking on the submit button. In the preview screen there is an "Edit Post" button you use to makes edits.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
OK, what am I missing here? The abbreviated states lookup are all uppercase, 2-characters, except Arizona, which, for testing purposes I have formatted like this:

AZABBCCDDDGGGEE

The below code works fine for all states, except Arizona which is generating an error.

Code:
open(MAP,"<$maplocation")||die "Could not open $maplocation";
    while (<MAP>) {
        chomp;
($code, $abbrev, $fullname)=split /::/, $_;
$states{$abbrev}=$fullname;
 }
close MAP;

open(COLS,"<$file")||die "Could not open $file";
@file=<COLS>;
close COLS;
foreach $line (@file) {
@data=split(/\|\|/,$line);      

########################### WILDCARD

$d8 =”$data[8]”;
$d8 =~ s/$abbrev.*//g; 

########################### WILDCARD

$data[8] = (exists $states{$d8}) ? $states{$data[8]} : 'Error';

 
I thought we already cleared up this part of your question.

Code:
        ($code, $abbrev, $fullname)=split /::/, $_;
        $abbrev = substr($abbrev,0,2);
        $states{$abbrev}=$fullname;

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
What I mean is, what if the field length is variable, i.e. the full state name field, where each text string would be a different length,and the ($abbrev,0,2) would not apply? Is there a wildcard that could be used in such situations?

Here is my previous question from April 9th:

For ewxample, would the code you just submitted apply if the $abbrev field was: "The quick brown fox" and the comparison field was: "The quick brown fox chases cars"?
 
Some further clarification:

Say I have a lookup field that contains:
"The Box is Blue" ($abbrev)

I would want to match it exactly, but I know some comparison data ($data[8]) will contain:

"The Box is Blue and Large"

Therefore, I would want $abbrev to be "$abbrev.*" (The Box is Blue.*) so if there are erroneous characters in the comparison string, no error will be generated so long as everything before the ".*" matches.
 
EricSilver, you are messing up things, making your point and position unclear.
Let me try to clear up your question, to check if I'm correct:
-you have a maplocation lookup file where the state abbreviations are all correct: 2 uppercase letters everywhere, and that's fine
-now you have an address file where the state abbreviations may not exactly correspond to those in the lookup
-here you should decide first to what extent you assume them to not correspond (this was the object of a Kevin's question above): you can have lowercase letters (simple to solve), the two letter code embedded in a longer string with extra characters before and after (with possible multiple correspondences) or else
-let's assume, as you confirmed above, that you take the first two letters as correct (except for the case), and that you expect only extra characters to the right (of any length and type)
Kevin gave you already the answer for this, except that he used it for the lookup file, because you told us first that the extra characters were in the lookup file.
Now, if my clarification above is correct, you simply have to do something like this (derived from your code and untested)
Code:
open(MAP,"<$maplocation")||die "Could not open $maplocation";
while (<MAP>) {
  chomp;
  ($code, $abbrev, $fullname)=split /::/, $_;
  $states{$abbrev}=$fullname;
}
close MAP;

open(COLS,"<$file")||die "Could not open $file";
@file=<COLS>;
close COLS;
foreach $line (@file) {
  @data=split(/\|\|/,$line);      
  $d8=uc(substr($data[8],0,2));
  $data[8]=(exists $states{$d8})?$states{$d8}:'Error';
}
As already recalled above, you cannot use wildcards with the [tt]exists[/tt] function: for more complex corrections to the abbreviations in the address file (e.g.extra characters before and after) you should necessarily check all the keys in [tt]%states[/tt] , possibly using a regexp.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I am dropping out of this thread, it has eaten it's own tail a couple of times now and continues to just turn circles. All the best.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
My apologies for the confusion; I am asking two separate, but similar questions in the same thread, and not taking into account that some of the "wildcard" functionality I want will occur by default. Thank you Kevin for your good help to this point. Franco, your understanding of the state lookup and abbreviation files is correct. That part already works perfectly.

I understand what to do if the length of the source file values are all 2-character, and the length of the state lookup file values are not: Use substr(xxx,0,2).

But if the length of the source values are not all 2 characters, things begin to fog up for me.

Example:

Source Values Lookup Values
------------- -------------
AZ AZ
AZ123 AZ
CAXYZ CA
NY123 NY
NYCCVB NYC
NYCVB NYCVB
MICH6789 MICH678

The current substr(x,0,2) code would work fine for the first four source/lookup values on the above list, but will generate errors for the last three. For those, I need to change the substr length in order to match them correctly.

Right now, I am wondering if it is possible for the substr length to be a variable, i.e.,

$data[8] = substr($abbrev,0,$var);

I could then insert code which, before conducting each lookup, counts the length of the target lookup value and makes that length the $var value.
 
To answer part of my own question, substr length can be a variable, so doing an the on-the-fly length change, just before the lookup, is where I will focus my energies, and hopefully get the result I need.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top