splitting to hash/substr function 1

mikevh · May 8, 2001

I'm reading a tab-delimited file with column headings and
writing out to a fixed-length text file. The code I have
works, but I'm wondering if there isn't a better way to do
some of this.

I want to get each line of input into a hash with the tab-
delimited data fields as values and the column headings as
keys. This makes the code much more self-documenting. The
only way I've been able to do this is to split the column
headings to an array and split each data line to an array,
then use a loop to build the hash from the 2 arrays. Is
there a better way to do this? Like somehow split the
data lines so that the fields become the hash values without
using a loop?

Also, I'm using the substr function to put the data where
it's supposed to be in the fixed-length file. Unless I make
the output record much longer than it really needs to be, I
get the error message "Substring outside of string" when I
run the program. However, it doesn't appear to me that I'm
actually trying to put anything in positions beyond the end
of the string I've defined. Anyone know why this happens and how to get around it without making the output records very long?

Code excerpt follows:

Code:

use constant OUTLEN => 900;

while (chomp($_= <>))
{
    if ($. == 1) {
        @headers = split(/\t/);  #get col. headers       
        next;
    }
    @F = split(/\t/);  #split input data to array

    # Create hash %H with array @F as values and array 
    # @headers as keys.
    for $i (0..$#F) {
        $H{$headers[$i]} = $F[$i];
    }

    $out = &quot; &quot; x OUTLEN . &quot;\n&quot;;  #initialize output record
    
    # Put values from hash %H into fixed positions in output
    # file.
    substr($out,0,10)   = $H{PHONE};
    substr($out, 50,5)  = $H{SEQ_ID};
    substr($out,55,15)  = $H{ID};
    substr($out,70,20)  = $H{FNAME};
    substr($out,90,20)  = $H{LNAME};
    
    # etc ...

    # Write output record
    print $out
}                                                           [\code]                    

Any comments/suggestions much appreciated.  Thanks.

stillflame · May 8, 2001

instead of substr, you could use the sprintf function ('perldoc -f sprintf'). something along the lines of:[tt]
$out = sprintf("%10s".(" " x 40)."%5s%15s%20s%20s\n", $H{PHONE}, $H{...});[/tt]
might work out for you. the initial string contruction may be a little trickier to get just right, but it'll work out well in the end. i'm actaully unsure myself of why you needed to set the string length to 900 using substr...

as for making the hash, you'll have to loop over every header item to do it, but there are ways of doing it that look better. first, the 'for' loop can be turned into a 'foreach' loop that shifts off element of @F (note that his will alter it, so make a copy if you need to use it later. (you didn't use it later in the code, which is why i suggested this method, but there's reference to more code than is listed, so i thought i'd mention it.)). then, the 'foreach' loop can be turned into a 'map' expression (which is, really, just an iterator itself):[tt]
%H = map {$_ => shift(@F)} @headers;[/tt]

or even better, you could make a new datatype which is accessed like a hash but you read it in as an array, having all the work done behind the scenes, but then again, that's a little bit too OO for something like this.

or is it?... *shameless love of ruby* *:->* "If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

tsdragon · May 9, 2001

For what you are doing above , you really don't need the headers at all! Since you know the order the fields are coming in, just split them into an array and use them directly from the array. I.E.:

Code:

substr($out,0,10)   = $F[0];
substr($out, 50,5)  = $F[1];
...

WARNING: Using substr as an lvalue like you are doing can cause you problems. Unless the strings you are inserting into $out are exactly the same length as the substrings you are replacing, $out will grow or shrink to accomodate the size of the new string. You need to make sure the strings you are inserting are the same size as the substrings you are replacing!
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.

mikevh · May 9, 2001

Thanks, stillflame. I especially like the

Code:

%H = map {$_ => shift(@F)} @headers;

bit. That's alot cleaner-looking than what I was doing
with the for loop. @F is never used again, so I don't
care about removing all the elements. (That's one of the
reasons I hoped I wouldn't have to explicitly split to it.)

As for your sprintf suggestion, I don't know about that.
There are about 40 fields. (That's what's happening in the
"etc ..." area.) And it's somewhat less readable.

mikevh · May 9, 2001

tsdragon - I know I don't need the hash. The reason for
using it is to make the code more self-documenting, so when
I or someone else look(s?) at it awhile down the road, it'll
be apparent that the phone number is going into the first 10
columns of the output and so on. The other way, all you can
see is that the first array element is going into the first
10 columns; you can't tell what's in that element. (And
given say, a week, I know I'll have no clue without researching it, which of course I'd rather not do.)

Thanks for the tip about using substr as an lvalue. According to what you say, if I were to say

Code:

substr($out,0,10) = sprintf(&quot;%-10s&quot;,$H{PHONE});
substr($out, 50,5)= sprintf(&quot;%-5s&quot;,$H{SEQ_ID});
substr($out,55,15)= sprintf(&quot;%-15s&quot;,$H{ID});

and so on, then the string $out wouldn't shrink. Is that
correct? (Haven't had a chance to try it yet.)

tsdragon · May 9, 2001

I'm not completely sure about the sprintf format, but it looks about right.

Another alternative to using the hash is to define variables for each of the subscripts of the array @F. Then you could use $F[$PHONE] and $F[$SEQ_ID]. Still pretty self-documenting, but avoids the hassle with the headers and the hash. Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.

mikevh · May 9, 2001

Yes, I could assign numeric values to separate variables,
e.g.

Code:

($PHONE,$SEQ_ID, (etc, etc)) = (1, 2, (etc, etc));

but that would be about 40 variables; also it breaks if the
order of the columns in the input file changes. With the
hash implementation, the order of the columns doesn't
matter as long as the columns have the correct headings
and the headings remain exactly the same. This is another
very compelling reason for using a hash here.

stillflame · May 10, 2001

i thought of something else that might help you. instead of using your predefined $out string of spaces, you could do it with a concatenation sequence, one for each sprintf you need to do. something like this:[tt]
$out = '';

$out .= sprintf("%-10s", $var1);
$out .= sprintf("%-5s", $var2);
:[/tt]

also, for your sprintf format string, to get a string of exactly a certain length, i'd suggest the following:[tt]
sprintf("%10.10s", $var)[/tt]
that makes a string with a minimum of 10 characters and a maximum of 10 characters(or, exactly 10). the negative sign will only determine which side it'll be justified to, if it's too short and spaces are added for filler (left for negative, right for positive). "If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

splitting to hash/substr function 1

mikevh

Programmer

stillflame

Programmer

tsdragon

Programmer

mikevh

Programmer

mikevh

Programmer

tsdragon

Programmer

mikevh

Programmer

stillflame

Programmer

Similar threads

Part and Inventory Search

Sponsor