I appologize in advance for this long post, but I REALLY need help!
I have searched the forum in depth trying to find something similar, and I have tried playing with some things I found that I thought might help... but I just am at wits end.
Basics -----------------------------------------------
Why:
In reading through the forums there are folks who like to know why I am writing the code. I am working in a lab and trying to apply bioinformatics tequniques to get work done faster and more efficient. Unfortuantely, I have never taken a programming course and I am trying to teach myself perl in order to get the data I need in the format I need it. I am a complete novice, but I feel I have learned a heck of a lot since I have been trying to get through this for about 2 months now with about 35 versions. I tried using modules such as bio:
What:
What I am tring to do is to use one text file (a list of numbers) to pull specific matching data from a second (a list of protein sequences in fasta format). The examples I am using are small for testing purposes, but I will need to go through multiple organisms (over 20) with thousands of sequences and deal with multiple data from each strain (around 45 gi numbers). Hence the fact that I do not want to do it all with copy/paste!
Files --------------------------------------------------
My testing files are as follows:
- list.txt (list)
325676729
325553884
325553882
- t1.txt (sequences)
>gi|325676729|ref|ZP_08156403.1| 6,7-dimethyl-8-ribityllumazine synthase (riboflavin synthase beta chain) [Rhodococcus equi ATCC 33707]
MSGEGRPDLQLGMAKNLKLAIVAGQWHPEISEALVAGAKRVAKQAQIEDPTLVRVAGAIELPVVVQELAK
SHDAVVALGVVIRGGTPHFEYVCDAVTAGLTRVALDEGVPVGNGVLTTDTEKQALDRSGLPGSVEDKGGE
ACAAAIDTAVTLAQLRRKRTGSASR
>gi|325553885|gb|EGD23563.1| glycerone kinase [Rhodococcus equi ATCC 33707]
MLEVLEAVDGSALYRWADACVTGIEKRCDEINDLNVFPVPDADTGTNLLATMRAAVRAAAPLSADERGAD
ASAVARALARGAVTGARGNSGAILSQVLRGVAESTKSHRLDADTFRSALRHASDLAL
>gi|325553884|gb|EGD23562.1| 50S ribosomal protein L28 [Rhodococcus equi ATCC 33707]
MAAVCDVCAKGPGFGKSVSHSHRRTNRRWNPNIQPVRAQVAPGNTKKLNVCTSCLKAGKVVRG
>gi|325553883|gb|EGD23561.1| enoyl-CoA hydratase [Rhodococcus equi ATCC 33707]
MSGAQSFSRLVRRGSRLLVTVAVDGTGLPDRPDRAASVLHVTATEAPSNVRIEDGVLRVAVATAANGTSL
DIDGITEATAALRAAGSDVGAVLLVGDGANFCAGGNVRAFASAERRGEFVGEIATAFHEFVRALDDTTVP
VVAAVHGWAAGAGMSIVCLADIAIGGTSTKLRPAYPSIGFTPDGGMSWTLPRIVGASRAREILLTDAVLN
GEESVRLGLLSRIVEDDQVQDEALRVARTLAAGPTASYAGIKKLFASSRANSLSEQLDAETASISAAADG
PTGREGVDAFVEKRRPDFSSVRNA
>gi|325553882|gb|EGD23560.1| uracil-DNA glycosylase [Rhodococcus equi ATCC 33707]
MAAKALTDLIDPGWAKALAPVEGRIAEMGDFLRAEIAAGRQYLPSGENVLRAFTHPFEDVRVLIVGQDPY
PTPGHAVGLSFSVAPDVRPVPRSLNNIFAEYSRDLGYPTPSNGDLTPWTENGVLLLNRVLTVAPGEAGSH
RRKGWEAVTEQAIRALVERQSPMVAILWGRDAATLKPMLGDVPTIESAHPSPLSASRGFFGSRPFSRANE
LLAELGANQVDWRLP
Output-------------------------------------------------
- The output I want are only those three blocks that contain the gi numbers found in the list, so:
>gi|325676729|ref|ZP_08156403.1| 6,7-dimethyl-8-ribityllumazine synthase (riboflavin synthase beta chain) [Rhodococcus equi ATCC 33707]
MSGEGRPDLQLGMAKNLKLAIVAGQWHPEISEALVAGAKRVAKQAQIEDPTLVRVAGAIELPVVVQELAK
SHDAVVALGVVIRGGTPHFEYVCDAVTAGLTRVALDEGVPVGNGVLTTDTEKQALDRSGLPGSVEDKGGE
ACAAAIDTAVTLAQLRRKRTGSASR
>gi|325553884|gb|EGD23562.1| 50S ribosomal protein L28 [Rhodococcus equi ATCC 33707]
MAAVCDVCAKGPGFGKSVSHSHRRTNRRWNPNIQPVRAQVAPGNTKKLNVCTSCLKAGKVVRG
>gi|325553882|gb|EGD23560.1| uracil-DNA glycosylase [Rhodococcus equi ATCC 33707]
MAAKALTDLIDPGWAKALAPVEGRIAEMGDFLRAEIAAGRQYLPSGENVLRAFTHPFEDVRVLIVGQDPY
PTPGHAVGLSFSVAPDVRPVPRSLNNIFAEYSRDLGYPTPSNGDLTPWTENGVLLLNRVLTVAPGEAGSH
RRKGWEAVTEQAIRALVERQSPMVAILWGRDAATLKPMLGDVPTIESAHPSPLSASRGFFGSRPFSRANE
LLAELGANQVDWRLP
Problems ----------------------------------------------
I seem to have 2 major issues (possibly more?).
1) Creating a while loop calling one data set nested within another while loop calling the other data set does not seem to work, ie...
while ($list = <list>) {
while ($seq = <seq>) {
code;
}
}
I seem to only get the sequence while on the very first iteration of the list while, and never again.
2) Trying to match the data from the sequence file to the array using regular expressions is near impossible using array data. I cannot get m// to look at the sequence and see if it matches any of the numbers in the list (which I had stored in an array). Other issues with the pattern matching is that when I use brackets to try to return the specific value found (when I gave up using the list and just wanted to get a variable with 9 numbers so that I could try pattern matching those to the list instead of the whole string), they were not saving to set $1, $2, etc variables.
I have been trying to approach this in many different ways, nothing seems to work.
Code --------------------------------------------------
I have tried many MANY versions, here is one where I am just trying to get problem 1 to work:
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$counter = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
@list = <LISTFILE>; # save list data into an array
$counter = 0;
$listcount = @list; # get a count of elements in the array
while ($counter ne $listcount + 1) { # go through each list element
while ($list = <FAAFILE>) {
$list = @list[$counter];
print "$list", "\n string", "\n\n"; # replace this with matching operator if I can get this to work
} # end while
$counter = $counter + 1;
print "Counter is now: ", $counter, "\n";
} # end while
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
More Code --------------------------------------------------
As I have said, I have done many many versions... I will include several others if that might help?
1 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$header = "";
$seq = "";
$Listmatch = 0;
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
while ($line = <FAAFILE>) { # For each line in the FASTA file
open(FAAFILEOUT, ">>test6.txt"); #open new file for appending
# Separate header data from sequence data
@list = <LISTFILE>;
print @list;
if ($line =~ /^[>]/i) {
$header = $line;
foreach $list(@list) {
#if it matches...
chomp $list;
if ($list =~ m/$line/) {
print "Gotcha ".$list.$line."\n";
$listmatch =1;
} # End if
} # End foreach
} # End if
else { # Separate out sequence
$seq = $seq.$line;
} # End else - if not header
if ($listmatch ==1) {
print "Do we get here?\n";
print FAAFILEOUT "$list \n";
print FAAFILEOUT "$header \n";
print FAAFILEOUT "$seq \n";
$listmatch = 0;
} # End if
} # End while - FAAFILE
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
2 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$header = "";
$seq = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
print @list." seeing this? \n"; # Test: Is it picking up everything?
while ($line = <FAAFILE>) {
print @list[0]."\n";
print @list[1]."\n";
print @list[2]."\n";
if ($line =~ m\/>gi/|@list[*]\) {
Print $line
open(FAAFILEOUT, ">>test6.txt"); #open new file for appending
print "Gotcha ".$list.$line."\n";
print "Do we get here?".$list.$line."\n";
print FAAFILEOUT "$list \n";
print FAAFILEOUT "$header \n";
print FAAFILEOUT "$seq \n";
} # End if
} # End while
# print quotemeta('>gi|'); gave me the following data: m\/>gi/|@list[.]\
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
3 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
while ($line = <FAAFILE>) {
while ($line =~ m\@list[*]\) {
print "Do we get here?".$list.$line."\n";
} # End while
} # End while
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
4 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
foreach $listitem(@list) {
chomp $listitem;
print $listitem;
$line = <FAAFILE>;
if ($line =~ m\(/d{9})\) {
print $line, "line here. \n";
print $1, "\n";
print "$1 \n";
print $2, "\n";
print "$2 \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End Foreach
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
5 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
print "@list", "\n";
while ($listitem = <LISTFILE>) {
print $listitem;
if ($line =~ m\$listitem\) {
print $line, "line here. \n";
print $1, "\n";
print "$1 \n";
print $2, "\n";
print "$2 \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End while
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
6 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
foreach $listitem(@list) {
chomp $listitem;
print $listitem;
$/ = undef;
$all = <FAAFILE>;
if ($all =~ m\/>gi/|(/d{9})\) {
print $line, "line here. \n";
print $1, "ab \n";
print "$1 ac \n";
print $2, "ad \n";
print "$2 ae \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End foreach
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
7 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
while ($line = <FAAFILE>) {
@list = (<LISTFILE>);
if ($line =~ m\[>gi/|</d{9}>]\) {
print $1, "1 \n";
print $list, "\n";
print "Got here \n";
print $line, " line \n";
} # End if
} # End while
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
8 ----------------------------------------------------------
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.
# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file
# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;
# Algorithm
while ($line = <FAAFILE>) {
@list = (<LISTFILE>);
if ($line =~ m\(/d/d/d/d/d/d/d/d/d)\) {
print $1, "1 \n";
print $list, "\n";
print "Got here \n";
print $line, " line \n";
} # End if
} # End while
# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;
------------------------------------------------------------
There are a lot more, but hopefully these will give you an idea of what I was trying for I hope!
Once again, sorry that this is so terribly long...