Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

split text to files 2

Status
Not open for further replies.

3inen

Technical User
May 26, 2005
51
US
Hi!
I have a text of fasta files and am trying to split. these are from 4 sources (> 1:, > 2:, > 3:, and > 4:) separated into blocks by #. i want to make separate files for each block such as file1 file 2 only if that block has information from all the four resources.

i tried a piece of code but it is far from executable. could you guys sort it out.

Thanks

existing code

#!/usr/bin/perl

open(DATA, "split.txt")\n";

while (<DATA>) {

$line = $_;
chomp($line);
$lineNum = 1;

if ($line =~ /^>/) {
$Name = $line;
$Name =~ s/^\s{1,}>//;
$Name =~ s/^>\s{1,}//;
$Name =~ s/>//;
$Name =~ s/\s.*$//;
$Name = "$Name" . "\.seq";

if ($numSeqs > 0) {
close(OUT_FILE);
}

open(OUT_FILE, ">$Name");
print OUT_FILE ">$Name\n";

}
else {
#$entries = "$entries" . "$line";
print OUT_FILE "$line\n";
} #end else

++$numSeqs;

} #end while

#print OUT_FILE "$entries\n";
close (OUT_FILE);
close (DATA);

print "entries = $numSeqs\n";

} #end Main



Input data

#
> 1:333078-333779
GAATATCCCCATGATCTTTCCCTCAATCGCCCGCTGATAAGTGGGAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCATGGTGGCGCTTGGTCTTCCCTTGGGAAT
> 2:659628-660329
GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT
> 3:682458-683159
GAATATCCCCATGATCTTCCCCTCAAACCTGGACGTTGATAAGTGGAAAGACATCG
> 4:1630596-1631297
GAATATCCCCATGATCTTTCCCTCAATCGACGCTGATAAGTGGGAAGACAT
GTCGCGCCACACTCGATACCCTGCTCGGCGCTTGGTCTTCCCTTGGGAATC
#
> 1:334683-335218
GGTTGGCGGTGCCGCCCTCGTGCAACCAATCAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCACTAAAAAGTTGAGTTAT
> 2:661233-661768
GGTTGGCGGCGCCGCCCTCGTGCAACCAACAAGTTTGGTGGCGATGTTG
CACCAACGCTAAGTGTAACCTACTACATCAAGGGGCATTACTAAAAAGTT
> 3:681133-681667
AAAATGCAGCACAGAATACTGTCAAGTTTGGTGGCGATGTTG
> 4:1632207-1632742
GGTTGGCGGCGCTGCCCTCGTGCAACCAAAGAAAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCAACGGGTATTACTAAAAAGTTGAG
#
> 1:335667-335823
AATGACCGAAATCAAGGAAGCTTTTGTCCCCCCCAGTGATTGAAGTGCTAGTCG
TTGGCGATACCGTCTCCAAGGGCCAAAGTTTCAACCATGGAAGTACCTTCGTCA
> 2:1731369-1731525
AATGACCGAAATCAAGGAAGCTTTTGTCCCCCCCAGTGATTGAAGTGCTAG
TTGGCGATACCGTCTCCAAGGGCCAAACAACCATGGAAGTACCCTCGTCA
> 3:679065-679221
AATGACCGAAATCAAGGAAGCTTTTGTCGTCCCAGTGATTGAAGTGCTAGTC
TTGGCGATACCGTCTCCAAGGGCCAAAGCAACCATGGAAGTACCCTCGTCA
#



desired output
file 1
> 1:333078-333779
GAATATCCCCATGATCTTTCCCTCAATCGCCCGCTGATAAGTGGGAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCATGGTGGCGCTTGGTCTTCCCTTGGGAAT
> 2:659628-660329
GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT
> 3:682458-683159
GAATATCCCCATGATCTTCCCCTCAAACCTGGACGTTGATAAGTGGAAAGACATCG
> 4:1630596-1631297
GAATATCCCCATGATCTTTCCCTCAATCGACGCTGATAAGTGGGAAGACAT
GTCGCGCCACACTCGATACCCTGCTCGGCGCTTGGTCTTCCCTTGGGAATC


file 2
> 1:334683-335218
GGTTGGCGGTGCCGCCCTCGTGCAACCAATCAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCACTAAAAAGTTGAGTTAT
> 2:661233-661768
GGTTGGCGGCGCCGCCCTCGTGCAACCAACAAGTTTGGTGGCGATGTTG
CACCAACGCTAAGTGTAACCTACTACATCAAGGGGCATTACTAAAAAGTT
> 3:681133-681667
AAAATGCAGCACAGAATACTGTCAAGTTTGGTGGCGATGTTG
> 4:1632207-1632742
GGTTGGCGGCGCTGCCCTCGTGCAACCAAAGAAAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCAACGGGTATTACTAAAAAGTTGAG

 
Mmmmm, Genetics! Yum. :)
Check this out, see if it does what you want:

Code:
#!/usr/bin/perl -w
use strict;
use diagnostics;
my $all="";
my $output="";
my $filecount=0;
while(<>) {
  $all .= $1 if(/> (\d):/);
  if(/^#/) {
    if($all eq "1234") {
      $filecount++;
      open FH, ">file$filecount" or die "Failed to create file$filecount";
      print FH $output;
      close FH;
      $output="";
      $all="";
    }
  } else {
    $output .= $_;
  }
}


I've always been interested in genetic data manipulation so any info on what this is all about would be very intersting.


Trojan.

 
just what i want, thanks a lot.

trying to find the minute differences in these sources called Single nucleotide polymorphisms; these are extremely potential target for different diagnostic and developmental tools.

look at the output from this file, alignment is a little goofed up but it carries the message:

1_333078-333779 GAATATCCCCATGATCTTTCCCTCAATCGCCC-----GCTGATAAGTGGGAAGACATCGG
4_1630596-1631297 GAATATCCCCATGATCTTTCCCTCAATCGAC------GCTGATAAGTGGGAAGACAT--G
2_659628-660329 GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCGG
3_682458-683159 GAATATCCCCATGATCTTCCCCTCAAACCTGG---ACGTTGATAAGTGGAAAGACATCG-


1_333078-333779 TCGCGCCACACTCGATACCCTGCTCAT----GGTGGCGCTTGGTCTTCCCTTGGGAAT-
4_1630596-1631297 TCGCGCCACACTCGATACCCTGCTC---------GGCGCTTGGTCTTCCCTTGGGAATC
2_659628-660329 TCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT-
3_682458-683159 -----------------------------------------------------------

But with thousands of sequences like these.....
tek help is required.

Thanks

 
If there's something specific you're trying to find, I'm always intersted in tinkering with complex data searches like that.
It's fun! :-D


Trojan.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top