Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

manipulating csv file with perl 2

Status
Not open for further replies.

loryluv

Technical User
May 4, 2005
7
CA
Hi all,

I'm a newbie in using Perl. I've got a csv file that is comma delimited. (Basically, there are 10 fields separated by commas on each line. Each field is a string field.) I need to:
1) get rid of the first line of the input file, the 2nd and 5th field for all the records
2) for the 3rd and 8th field, I need to take a substring of that field. The substring will start after the word TAX: and end before the first occurence of either a asterisk or then end of the field (which will be a comma)

I searched on this forum and have started the code....but couldn't get very far with it. I think I'll get a problem later on since I have gotten rid of the comma delimiters, then later for the output I would have to concatenate the fields again with a comma. Any ideas? I hope you can help me. Thanks in advance!

Example:

input file
-----------
Heading 1,2,3,4,5,6,7,8,9,10
aaa,bbbb,ccTAX:code*cccc,dddd,eee,fff,ggggg,hhhhTAX:code2,iii,jjj
Aaaa,Bbb,CcccTAX:code3,Dd,Eeee,Fffff,Gggg,hTAX:code4*hh,Iiii,Jjjjjjj

Output file
-------------
aaa,code,dddd,fff,ggggg,code2,iii,jjj
Aaaa,code3,Dd,Fffff,Gggg,code4,Iiii,Jjjjjjj


Code:
#!/usr/local/bin/perl

open (intax, "taxcoding.csv");
open (outtax, ">taxout.csv");
$_ = <intax>;
$input = <intax>;
while($input){
	@fields = parse_csv($input);
	$fields[$i]
	$_ = $fields[1];
        $_ = $fields[4];
        $input = <intax>;
}
close (intax);
close (outtax); 


sub parse_csv {
    my $text = shift; 
    my @new  = ();
    push(@new, $+) while $text =~ m{
       "([^\"\\]*(?:\\.[^\"\\]*)*)",?
           |  ([^,]+),?
           | ,
       }gx;
       push(@new, undef) if substr($text, -1,1) eq ',';
       return @new;
}
 
Code:
#!perl
use strict;
use warnings;
use Text::CSV_XS;

open (INTAX, "taxcoding.csv") || 
    die qq(Can't open "taxcoding.csv" for input!\n);
open (OUTTAX, ">taxout.csv") ||  
    die qq(Can't open "taxout.csv" for ouput!\n);

my $csv = Text::CSV_XS->new();

while (defined(my $line = <INTAX>)) {
    next if $. == 1;
    chomp $line;
    if (my $status = $csv->parse($line)) {
        my @columns = $csv->fields;
        for (2,7) {
            $columns[$_] =~ s/^.*://;
            $columns[$_] =~ s/\*.*$//;
        }
        print OUTTAX join(",", @columns[0,2,3,5..$#columns]), "\n";
    } else {
        warn qq(Error line $.: $csv->error_input()\n);
    }
}
close (INTAX) || die qq(Can't close "taxcoding.csv" after read!\n);
close (OUTTAX) ||  die qq(Can't close "taxout.csv" after write!\n);
Output with input as posted above:
Code:
aaa,code,dddd,fff,ggggg,code2,iii,jjj
Aaaa,code3,Dd,Fffff,Gggg,code4,Iiii,Jjjjjjj



 
here is a way without using the Text::CSV_XS module but that way would be better if you have the module:

Code:
#!/usr/local/bin/perl
use strict;

open (INTAX, "taxcoding.csv") or die "$!";
open (OUTTAX, ">taxout.csv") or die "$!";
while (<INTAX>){
   next if $. == 1; #skips first line of file
   my @temp = split(/,/);
   @temp = ($temp[0],parse_csv($temp[2]),$temp[3],$temp[5],$temp[6],parse_csv($temp[7]),$temp[8],$temp[9]);
   print OUTTAX join(',',@temp);
}
close(INTAX);
close(OUTTAX);

sub parse_csv {
   $_ = shift;
   my $pos = index($_,'*');
   $_ = substr($_,0,$pos) if $pos > -1; 
   /TAX:(.*)/;
   return($1);
}
 
wow....thanks a lot!

just another question:

How can I make the output file name to be in the following format?

taxout050505.csv where 050505 is date that the csv file is created in the format ddmmyy

Thanks in advance! and Thanks again for all the help

-Lory
 
you could create the date like this:

Code:
use POSIX qw(strftime);

(my $c_date = strftime("%x", localtime(time))) =~ tr/\///d;
my $filename = "taxout$c_date.csv";
open (OUTTAX, ">$filename") or die "$!";
....

the date is mmddyy though instead of ddmmyy but that could be fixed if it's really necessary.
 
This'll give it to you as ddmmyy.
Code:
my ($mday, $mon, $year) = (localtime(time))[3..5];
my $cdate = sprintf "%02d" x 3, $mday, $mon+1, $year%100;
my $filename = "taxout${cdate}.csv";
open (OUTTAX, ">$filename") or die "$!";



 
Thanks to both KevinADC and mikevh!!

However, I ran into a big problem. Some of the fields contain comma and I ran both of your quote and obviously with the split function it splits up the field when it sees the comma and with the CSV_XS module it does the same thing.

Please help!

Thanks again for all the help so far! I really appreciate it.
 
Fields in a .csv file that contain commas are supposed to be surrounded in double quotes. Text::CSV_XS should handle this correctly. So I'm guessing your file is not a standard .csv file.

Can you post a representative sample of your file, surrounded in [ignore]
Code:
[/ignore] tags? If you don't know about [ignore]
Code:
[/ignore] tags, please click on Process TGML below the box where you type your posts for an explanation.

 
Thanks for replying mikevh!

Here's what it looks like....

Code:
Full Name[b],[/b]Initials[b],[/b]Tax Code[b],[/b]Client ID,Email[b],[/b]Company[b],[/b]Address[b],[/b]Contract Code[b],[/b]Phone Number[b],[/b]Secondary Phone
John Smith[b],[/b]J[b],[/b]g50=:x700OHTAX:00029%@JohnSmith;y=80[b],[/b]546728[b],[/b]John.smith@company.com[b],[/b]Company[b],[/b]50 Ford Drive, Toronto[b],[/b]x:=y893700OHTAX:01647[b],[/b](416)245-5015[b],[/b](416)245-2022
Carol Harrison[b],[/b]CG[b],[/b]hs=8:@Cy49haxTAX:495[b],[/b]5542[b],[/b]Carolharrison@comp.com[b],[/b]comp name[b],[/b]324 Ranson St., Toronto, N8G9O5[b],[/b]9sd@=Carol:TAX:228%lof[b],[/b](416)547-6211[b],[/b](416)347-6554

output csv file be:
Code:
John Smith[b],[/b]00029[b],[/b]546728[b],[/b]Company[b],[/b]50 Ford Drive, Toronto[b],[/b]01647[b],[/b](416)245-5015[b],[/b](416)245-2022
Carol Harrison[b],[/b]495[b],[/b]5542[b],[/b]comp name[b],[/b]324 Ranson St., Toronto, N8G9O5[b],[/b]228[b],[/b](416)547-6211[b],[/b](416)347-6554

Thanks again for all the help.
-Lory
 
If you'd supplied more sample data, I could have tested this more thoroughly but it appears to work fine with what you did supply.

Code:
#!/usr/bin/perl -w
use strict;
my $header = <DATA>;
while(<DATA>) {
  chomp;
  my @fields = split /,/, $_, -1;
  $fields[2] =~ s/^.*?TAX:(\d+).*$/$1/;
  $fields[$#fields-2] =~ s/^.*?TAX:(\d+).*$/$1/;
  print join(",", @fields[0,2,3,5 .. $#fields]), "\n";
}
__DATA__
Full Name,Initials,Tax Code,Client ID,Email,Company,Address,Contract Code,Phone Number,Secondary Phone
John Smith,J,g50=:x700OHTAX:00029%@JohnSmith;y=80,546728,John.smith@company.com,Company,50 Ford Drive, Toronto,x:=y893700OHTAX:01647,(416)245-5015,(416)245-2022Carol Harrison,CG,hs=8:@Cy49haxTAX:495,5542,Carolharrison@comp.com,comp name,324 Ranson St., Toronto, N8G9O5,9sd@=Carol:TAX:228%lof,(416)547-6211,(416)347-6554

Trojan


 
The data you posted does not match your original criteria:

2) for the 3rd and 8th field, I need to take a substring of that field. The substring will start after the word TAX: and end before the first occurence of either a asterisk or then end of the field (which will be a comma)

 
Nice input file. The first line contains 9 commas, the second 10, and the third 11. The problem is that the Address field contains embedded commas, and this field is not quoted.

As KevinADC notes, the output you posted does not match your original specifications.

The commas in the Address field are all followed by whitespace, while the true "field separator" commas are not. If we can count on this, then the following works:
Code:
#!perl
use strict;
use warnings;

my ($mday, $mon, $year) = (localtime(time))[3..5];
my $cdate = sprintf "%02d" x 3, $mday, $mon+1, $year%100;

my $infile = "taxcoding.csv";
my $outfile = "taxout${cdate}.csv";

open (INTAX, $infile) || die qq(Can't open "$infile" for input!\n);
open (OUTTAX, ">", $outfile) ||  die qq(Can't open "$outfile" for output!\n);

while (defined(my $line = <INTAX>)) {
    next if $. == 1;
    chomp $line;

    [b]# split on comma not followed by whitespace
    my @columns = split(/,(?!\s+)/, $line);

    # quote columns containing commas
    @columns = quotecol(@columns);[/b]
    for (2,7) {
        $columns[$_] =~ s/^.*://;
        [b]$columns[$_] =~ s/%.*$//;[/b]
    }
    print OUTTAX join(",", @columns [0,2,3,5..$#columns]), "\n";
}

sub quotecol {
    # quote columns containing commas, if not already quoted
    my @columns = @_;
    for (@columns) {
        if (/,/) {
            unless (/^"[^"]*"$/) {
                $_ = qq("$_");
            }
        }
    }
    return @columns;
}
The quotecol sub quotes the Address field, making your output file more of a proper .csv file. If you don't want this field quoted, then just take that out.

Output
Code:
John Smith,00029,546728,Company,"50 Ford Drive, Toronto",01647,(416)245-5015,(416)245-2022
Carol Harrison,495,5542,comp name,"324 Ranson St., Toronto, N8G9O5",228,(416)547-6211,(416)347-6554




 
I'm sorry I meant to say % in my first specifications. But Thanks!! It works fine! Thanks a lot to all that replied! I really appreciated it. Thanks again.

Just wondering, does Perl support Unicode? I've got a couple records that didn't show up at all in the output file. In the input file, the person's name is in french so it had a accented "e" and the whole record just got ignored in the output file. I've tried to remove that "e" character from the name and the record shows up fine in the output.....any ideas??

Thanks again! Hope you all have a great day!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top