Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations MikeeOK on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract a column from tab delimited text files and merge

Status
Not open for further replies.

atang84

Programmer
Nov 12, 2008
1
AU
I have to extract a column from 177 tab delimited huge text files (more than 2MB each) and merge the columns together without appending (meaning I want the columns side by side)

Say I've got:
file1
..
..
file177

Each file has
file1:
A B C D E ...
32.3 34
52.1 352.1
53.65 .
. .
. .
.many rows


file2:
A B C D E ...
52.7 5.1
4.1 12.3
36.3 .
. .
. .
.many rows (20000+ rows)


In each file I want to extract only 1 column say column B.

And the output file is to look like this:

B(file1) B (file2) B(file3) ... B(file177)
34 5.1 .
352.1 12.3 .
. . .
. .
.
(20000+ rows)



The code I've got here merges the columns all in a single row which I don't want ...

#!/usr/bin/perl -w

## path to all files in a directory
@files = <*.txt>;
## output filename in same directory
open (OUTFILE, ">result.txt");

##loop for each file in the array
foreach $file (@files)
{

## read line from each file
open (MYLINE, $file);

## loop each line
while (<MYLINE>)
{

chomp;

##split into columns using with \t
@col = split(/\t/, $_);
print OUTFILE "$col[1]\t";

}
close (MYLINE);
}
close (OUTFILE);


Is there a way to solve this?

Thanks



 
I've been trying to do this one for over an hour now! Its easy enough to get the columns side by side using a hash (key = line number AND value(s) = data items from each file) (however, this might not be the best method). However the issue I am having is trying to work out how to calculate the number of tab spaces required to ensure the output file is readable.

Chris
 
Hi

That looks fundamentally wrong. I would do this :
Code:
#!/usr/bin/perl

open OUT,">result.txt";

for ($i=1;$i<=117;$i++) {
  open $fil[$i],"<file$i.txt";
}

$ok=1;
while ($ok) {
  for ($i=1;$i<=117;$i++) {
    $fil=$fil[$i];
    $ok=0 unless chomp($str=<$fil>);
    print OUT $str,$i<117?"\t":"\n";
  }
}

for ($i=1;$i<=117;$i++) {
  close $fil[$i];
}

close OUT;
( Actually I would do [tt]paste file*.txt > result.txt[/tt] , but that is off-topic. )


Feherke.
 
Hi

Oops. Sorry, I used my own ( 1 column :-( ) test file and I forgot you want to paste just a specific column. But as I saw, you are familiar with [tt]split[/tt], so I am sure you will be able to correct it in no time.

( Also ignore my final comment in parenthesis. :-( )

Feherke.
 
Okay, I have finished a working script. Probably needs further testing but as far as I know it does what is required EXCEPT chomp off newline character for final column (end of each line) (unsure why this won't work,i'm sure you can fix, therefore final column currently processes unexpectedly). Note that a $max_columns variable MUST be set. This is so that non-existant data in lines in shorter files is processed as tab spaces (otherwise the structure/readability of the output file is poor). Also note that this method is possibly inefficient (especially if your input files are large), however I will try and improve on the script and i'm sure others may have other suggestions (such as feherke above).

Code:
#! /usr/bin/perl
use strict;

my $dirpath = 'Path/To/Directory/Containing/Input/Files'; #Path to directory containing input files
my $outputpath = 'Path/To/Output/File.txt'; #Path to output file
my $column_to_output = 1; #0=ColumnA, 1=ColumnB, 2=ColumnC etc
my $max_rows = 20000; #Maximum number of rows (this is needed for calculating tab spacing)
my (%data_hash);
#Open directory containing all files and put into @files
opendir(DIR,"$dirpath") || die "Cannot Open Dir: $!";
my @files = grep {!/^\.{1,2}$/} readdir(DIR);
closedir(DIR);
#Foreach @files
foreach my $file (@files){
	#Open the file
	open (INPUT, "<$dirpath/$file") || die "Cannot Open Input File: $!";
	my @lines = <INPUT>;
	close (INPUT);
	# Headings into hash
	#If file hash key exists, append column/filename
	if(exists $data_hash{'0'}){
		$data_hash{'0'} = $data_hash{'0'} . "\t" . "$column_to_output($file)";
	}
	#Else create file hash key and append column/filename
	else{
		$data_hash{'0'} = "$column_to_output($file)";
	}
	#Line number counter
	my $line_no = 0;	
	while($line_no < $max_rows){
		my $line = $lines[$line_no];
		#*****chomp($line);		
		$line_no++;
		#Split line by each column
		my @column = split(/\t/,$line);		
		my $the_column = $column[$column_to_output];
		if($the_column eq ""){
			$the_column = "\t";
		}
		# Lines into hash
		#If hash key exists, append new value
		if(exists $data_hash{$line_no}){		
			$data_hash{$line_no} = $data_hash{$line_no} . "\t" . $the_column;
		}	
		#Else create hash key and append first value
		else{
			$data_hash{$line_no} = $the_column;
		}
	}
}
#Open output file and append hash values
open (OUTPUT, ">>$outputpath") || die "Cannot Open Output File: $!";
foreach (sort keys %data_hash){
	print OUTPUT "$data_hash{$_}\n";
}
close (OUTPUT);
#Script has finished processing
print "Process Complete";

Chris
 
I can see two methods:
1)open all the 177 input files at once and process line by line all of them
2)treat [tt]result.txt[/tt] as a fixed record length file, writing each new line with a length sufficient to hold 177 data.
I think the first one is better, and is simpler for sure. Here it goes (Zhris, if you want neatly aligned columns, use [tt]printf[/tt]):
Code:
#!/usr/bin/perl -w
my(@handles);
unlink"result.txt"; #would loop if already present
for(<*.txt>){
  open($handles[@handles],$_);
}
open(OUTFILE,">result.txt");
my$atleastone=1;
while($atleastone){
  $atleastone=0;
  for my$op(@handles){
    if($_=readline($op)){
      my@col=split;
      $col[1]+=0;  #otherwise you print nothing but a \t if column 2 is undef
      print OUTFILE"$col[1]\t";
      $atleastone=1;
    }else{
      print OUTFILE"0\t";
    }
  }
  print OUTFILE"\n";
} 
undef@handles; #closes all files
close(OUTFILE);

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I now also see prex1's method 1 as being the better option, although as feherke mentions, it could exhaust the available number of open file handles.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top