split text to files 2

3inen · Jun 15, 2005

Hi!
I have a text of fasta files and am trying to split. these are from 4 sources (> 1:, > 2:, > 3:, and > 4

separated into blocks by #. i want to make separate files for each block such as file1 file 2 only if that block has information from all the four resources.

i tried a piece of code but it is far from executable. could you guys sort it out.

Thanks

existing code

#!/usr/bin/perl

open(DATA, "split.txt")\n";

while (<DATA>) {

$line = $_;
chomp($line);
$lineNum = 1;

if ($line =~ /^>/) {
$Name = $line;
$Name =~ s/^\s{1,}>//;
$Name =~ s/^>\s{1,}//;
$Name =~ s/>//;
$Name =~ s/\s.*$//;
$Name = "$Name" . "\.seq";

if ($numSeqs > 0) {
close(OUT_FILE);
}

open(OUT_FILE, ">$Name");
print OUT_FILE ">$Name\n";

}
else {
#$entries = "$entries" . "$line";
print OUT_FILE "$line\n";
} #end else

++$numSeqs;

} #end while

#print OUT_FILE "$entries\n";
close (OUT_FILE);
close (DATA);

print "entries = $numSeqs\n";

} #end Main

Input data

#
> 1:333078-333779
GAATATCCCCATGATCTTTCCCTCAATCGCCCGCTGATAAGTGGGAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCATGGTGGCGCTTGGTCTTCCCTTGGGAAT
> 2:659628-660329
GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT
> 3:682458-683159
GAATATCCCCATGATCTTCCCCTCAAACCTGGACGTTGATAAGTGGAAAGACATCG
> 4:1630596-1631297
GAATATCCCCATGATCTTTCCCTCAATCGACGCTGATAAGTGGGAAGACAT
GTCGCGCCACACTCGATACCCTGCTCGGCGCTTGGTCTTCCCTTGGGAATC
#
> 1:334683-335218
GGTTGGCGGTGCCGCCCTCGTGCAACCAATCAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCACTAAAAAGTTGAGTTAT
> 2:661233-661768
GGTTGGCGGCGCCGCCCTCGTGCAACCAACAAGTTTGGTGGCGATGTTG
CACCAACGCTAAGTGTAACCTACTACATCAAGGGGCATTACTAAAAAGTT
> 3:681133-681667
AAAATGCAGCACAGAATACTGTCAAGTTTGGTGGCGATGTTG
> 4:1632207-1632742
GGTTGGCGGCGCTGCCCTCGTGCAACCAAAGAAAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCAACGGGTATTACTAAAAAGTTGAG
#
> 1:335667-335823
AATGACCGAAATCAAGGAAGCTTTTGTCCCCCCCAGTGATTGAAGTGCTAGTCG
TTGGCGATACCGTCTCCAAGGGCCAAAGTTTCAACCATGGAAGTACCTTCGTCA
> 2:1731369-1731525
AATGACCGAAATCAAGGAAGCTTTTGTCCCCCCCAGTGATTGAAGTGCTAG
TTGGCGATACCGTCTCCAAGGGCCAAACAACCATGGAAGTACCCTCGTCA
> 3:679065-679221
AATGACCGAAATCAAGGAAGCTTTTGTCGTCCCAGTGATTGAAGTGCTAGTC
TTGGCGATACCGTCTCCAAGGGCCAAAGCAACCATGGAAGTACCCTCGTCA
#

desired output
file 1
> 1:333078-333779
GAATATCCCCATGATCTTTCCCTCAATCGCCCGCTGATAAGTGGGAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCATGGTGGCGCTTGGTCTTCCCTTGGGAAT
> 2:659628-660329
GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCG
GTCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT
> 3:682458-683159
GAATATCCCCATGATCTTCCCCTCAAACCTGGACGTTGATAAGTGGAAAGACATCG
> 4:1630596-1631297
GAATATCCCCATGATCTTTCCCTCAATCGACGCTGATAAGTGGGAAGACAT
GTCGCGCCACACTCGATACCCTGCTCGGCGCTTGGTCTTCCCTTGGGAATC

file 2
> 1:334683-335218
GGTTGGCGGTGCCGCCCTCGTGCAACCAATCAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCACTAAAAAGTTGAGTTAT
> 2:661233-661768
GGTTGGCGGCGCCGCCCTCGTGCAACCAACAAGTTTGGTGGCGATGTTG
CACCAACGCTAAGTGTAACCTACTACATCAAGGGGCATTACTAAAAAGTT
> 3:681133-681667
AAAATGCAGCACAGAATACTGTCAAGTTTGGTGGCGATGTTG
> 4:1632207-1632742
GGTTGGCGGCGCTGCCCTCGTGCAACCAAAGAAAAGTTTGGTGGCGATGTTG
CACCAACGCTTAGTGTCACCTACTACATCAACGGGTATTACTAAAAAGTTGAG

TrojanWarBlade · Jun 15, 2005

Mmmmm, Genetics! Yum.

Check this out, see if it does what you want:

Code:

#!/usr/bin/perl -w
use strict;
use diagnostics;
my $all="";
my $output="";
my $filecount=0;
while(<>) {
  $all .= $1 if(/> (\d):/);
  if(/^#/) {
    if($all eq "1234") {
      $filecount++;
      open FH, ">file$filecount" or die "Failed to create file$filecount";
      print FH $output;
      close FH;
      $output="";
      $all="";
    }
  } else {
    $output .= $_;
  }
}

I've always been interested in genetic data manipulation so any info on what this is all about would be very intersting.

Trojan.

3inen · Jun 15, 2005

just what i want, thanks a lot.

trying to find the minute differences in these sources called Single nucleotide polymorphisms; these are extremely potential target for different diagnostic and developmental tools.

look at the output from this file, alignment is a little goofed up but it carries the message:

1_333078-333779 GAATATCCCCATGATCTTTCCCTCAATCGCCC-----GCTGATAAGTGGGAAGACATCGG
4_1630596-1631297 GAATATCCCCATGATCTTTCCCTCAATCGAC------GCTGATAAGTGGGAAGACAT--G
2_659628-660329 GAATATCCCCATGATCTTCCCCTCAATCGACCTGGACGCTGATAAGTGGAAAGACATCGG
3_682458-683159 GAATATCCCCATGATCTTCCCCTCAAACCTGG---ACGTTGATAAGTGGAAAGACATCG-

1_333078-333779 TCGCGCCACACTCGATACCCTGCTCAT----GGTGGCGCTTGGTCTTCCCTTGGGAAT-
4_1630596-1631297 TCGCGCCACACTCGATACCCTGCTC---------GGCGCTTGGTCTTCCCTTGGGAATC
2_659628-660329 TCGCGCCACACTCGATACCCTGCTCACCCGCATTGGCGCTTGGTCTTCCCTTGGGAAT-
3_682458-683159 -----------------------------------------------------------

But with thousands of sequences like these.....
tek help is required.

Thanks

TrojanWarBlade · Jun 15, 2005

If there's something specific you're trying to find, I'm always intersted in tinkering with complex data searches like that.
It's fun! :-D

Trojan.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

split text to files 2

3inen

Technical User

TrojanWarBlade

Programmer

3inen

Technical User

TrojanWarBlade

Programmer

Similar threads

Part and Inventory Search

Sponsor