Need help parsing one text file based on data from a second text file

TinkerTok · Jul 11, 2011

I appologize in advance for this long post, but I REALLY need help!
I have searched the forum in depth trying to find something similar, and I have tried playing with some things I found that I thought might help... but I just am at wits end.

Basics -----------------------------------------------

Why:
In reading through the forums there are folks who like to know why I am writing the code. I am working in a lab and trying to apply bioinformatics tequniques to get work done faster and more efficient. Unfortuantely, I have never taken a programming course and I am trying to teach myself perl in order to get the data I need in the format I need it. I am a complete novice, but I feel I have learned a heck of a lot since I have been trying to get through this for about 2 months now with about 35 versions. I tried using modules such as bio:

erl but it just does not seem to work (possibly because I am using activestate?) despite hounding over the tutorials. However, everything about perl seems to indicate that this SHOULD be a simple task.

What:
What I am tring to do is to use one text file (a list of numbers) to pull specific matching data from a second (a list of protein sequences in fasta format). The examples I am using are small for testing purposes, but I will need to go through multiple organisms (over 20) with thousands of sequences and deal with multiple data from each strain (around 45 gi numbers). Hence the fact that I do not want to do it all with copy/paste!

Files --------------------------------------------------

My testing files are as follows:

- list.txt (list)
325676729
325553884
325553882

- t1.txt (sequences)
>gi|325676729|ref|ZP_08156403.1| 6,7-dimethyl-8-ribityllumazine synthase (riboflavin synthase beta chain) [Rhodococcus equi ATCC 33707]
MSGEGRPDLQLGMAKNLKLAIVAGQWHPEISEALVAGAKRVAKQAQIEDPTLVRVAGAIELPVVVQELAK
SHDAVVALGVVIRGGTPHFEYVCDAVTAGLTRVALDEGVPVGNGVLTTDTEKQALDRSGLPGSVEDKGGE
ACAAAIDTAVTLAQLRRKRTGSASR
>gi|325553885|gb|EGD23563.1| glycerone kinase [Rhodococcus equi ATCC 33707]
MLEVLEAVDGSALYRWADACVTGIEKRCDEINDLNVFPVPDADTGTNLLATMRAAVRAAAPLSADERGAD
ASAVARALARGAVTGARGNSGAILSQVLRGVAESTKSHRLDADTFRSALRHASDLAL
>gi|325553884|gb|EGD23562.1| 50S ribosomal protein L28 [Rhodococcus equi ATCC 33707]
MAAVCDVCAKGPGFGKSVSHSHRRTNRRWNPNIQPVRAQVAPGNTKKLNVCTSCLKAGKVVRG
>gi|325553883|gb|EGD23561.1| enoyl-CoA hydratase [Rhodococcus equi ATCC 33707]
MSGAQSFSRLVRRGSRLLVTVAVDGTGLPDRPDRAASVLHVTATEAPSNVRIEDGVLRVAVATAANGTSL
DIDGITEATAALRAAGSDVGAVLLVGDGANFCAGGNVRAFASAERRGEFVGEIATAFHEFVRALDDTTVP
VVAAVHGWAAGAGMSIVCLADIAIGGTSTKLRPAYPSIGFTPDGGMSWTLPRIVGASRAREILLTDAVLN
GEESVRLGLLSRIVEDDQVQDEALRVARTLAAGPTASYAGIKKLFASSRANSLSEQLDAETASISAAADG
PTGREGVDAFVEKRRPDFSSVRNA
>gi|325553882|gb|EGD23560.1| uracil-DNA glycosylase [Rhodococcus equi ATCC 33707]
MAAKALTDLIDPGWAKALAPVEGRIAEMGDFLRAEIAAGRQYLPSGENVLRAFTHPFEDVRVLIVGQDPY
PTPGHAVGLSFSVAPDVRPVPRSLNNIFAEYSRDLGYPTPSNGDLTPWTENGVLLLNRVLTVAPGEAGSH
RRKGWEAVTEQAIRALVERQSPMVAILWGRDAATLKPMLGDVPTIESAHPSPLSASRGFFGSRPFSRANE
LLAELGANQVDWRLP

Output-------------------------------------------------

- The output I want are only those three blocks that contain the gi numbers found in the list, so:

>gi|325676729|ref|ZP_08156403.1| 6,7-dimethyl-8-ribityllumazine synthase (riboflavin synthase beta chain) [Rhodococcus equi ATCC 33707]
MSGEGRPDLQLGMAKNLKLAIVAGQWHPEISEALVAGAKRVAKQAQIEDPTLVRVAGAIELPVVVQELAK
SHDAVVALGVVIRGGTPHFEYVCDAVTAGLTRVALDEGVPVGNGVLTTDTEKQALDRSGLPGSVEDKGGE
ACAAAIDTAVTLAQLRRKRTGSASR
>gi|325553884|gb|EGD23562.1| 50S ribosomal protein L28 [Rhodococcus equi ATCC 33707]
MAAVCDVCAKGPGFGKSVSHSHRRTNRRWNPNIQPVRAQVAPGNTKKLNVCTSCLKAGKVVRG
>gi|325553882|gb|EGD23560.1| uracil-DNA glycosylase [Rhodococcus equi ATCC 33707]
MAAKALTDLIDPGWAKALAPVEGRIAEMGDFLRAEIAAGRQYLPSGENVLRAFTHPFEDVRVLIVGQDPY
PTPGHAVGLSFSVAPDVRPVPRSLNNIFAEYSRDLGYPTPSNGDLTPWTENGVLLLNRVLTVAPGEAGSH
RRKGWEAVTEQAIRALVERQSPMVAILWGRDAATLKPMLGDVPTIESAHPSPLSASRGFFGSRPFSRANE
LLAELGANQVDWRLP

Problems ----------------------------------------------

I seem to have 2 major issues (possibly more?).

1) Creating a while loop calling one data set nested within another while loop calling the other data set does not seem to work, ie...
while ($list = <list>) {
while ($seq = <seq>) {
code;
}
}
I seem to only get the sequence while on the very first iteration of the list while, and never again.

2) Trying to match the data from the sequence file to the array using regular expressions is near impossible using array data. I cannot get m// to look at the sequence and see if it matches any of the numbers in the list (which I had stored in an array). Other issues with the pattern matching is that when I use brackets to try to return the specific value found (when I gave up using the list and just wanted to get a variable with 9 numbers so that I could try pattern matching those to the list instead of the whole string), they were not saving to set $1, $2, etc variables.

I have been trying to approach this in many different ways, nothing seems to work.

Code --------------------------------------------------

I have tried many MANY versions, here is one where I am just trying to get problem 1 to work:

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$counter = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
@list = <LISTFILE>; # save list data into an array
$counter = 0;
$listcount = @list; # get a count of elements in the array
while ($counter ne $listcount + 1) { # go through each list element
while ($list = <FAAFILE>) {
$list = @list[$counter];
print "$list", "\n string", "\n\n"; # replace this with matching operator if I can get this to work
} # end while
$counter = $counter + 1;
print "Counter is now: ", $counter, "\n";
} # end while

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

More Code --------------------------------------------------

As I have said, I have done many many versions... I will include several others if that might help?

1 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$header = "";
$seq = "";
$Listmatch = 0;
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
while ($line = <FAAFILE>) { # For each line in the FASTA file
open(FAAFILEOUT, ">>test6.txt"); #open new file for appending

# Separate header data from sequence data
@list = <LISTFILE>;
print @list;

if ($line =~ /^[>]/i) {
$header = $line;
foreach $list(@list) {
#if it matches...
chomp $list;
if ($list =~ m/$line/) {
print "Gotcha ".$list.$line."\n";
$listmatch =1;
} # End if
} # End foreach
} # End if

else { # Separate out sequence
$seq = $seq.$line;
} # End else - if not header

if ($listmatch ==1) {
print "Do we get here?\n";
print FAAFILEOUT "$list \n";
print FAAFILEOUT "$header \n";
print FAAFILEOUT "$seq \n";
$listmatch = 0;
} # End if

} # End while - FAAFILE

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

2 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$header = "";
$seq = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
print @list." seeing this? \n"; # Test: Is it picking up everything?
while ($line = <FAAFILE>) {
print @list[0]."\n";
print @list[1]."\n";
print @list[2]."\n";
if ($line =~ m\/>gi/|@list[*]\) {
Print $line
open(FAAFILEOUT, ">>test6.txt"); #open new file for appending
print "Gotcha ".$list.$line."\n";
print "Do we get here?".$list.$line."\n";
print FAAFILEOUT "$list \n";
print FAAFILEOUT "$header \n";
print FAAFILEOUT "$seq \n";
} # End if
} # End while

# print quotemeta('>gi|'); gave me the following data: m\/>gi/|@list[.]\

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

3 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
while ($line = <FAAFILE>) {
while ($line =~ m\@list[*]\) {
print "Do we get here?".$list.$line."\n";
} # End while
} # End while

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

4 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm

# Save entire list as an array.
@list = <LISTFILE>;
foreach $listitem(@list) {
chomp $listitem;
print $listitem;
$line = <FAAFILE>;
if ($line =~ m$/d{9})$ {
print $line, "line here. \n";
print $1, "\n";
print "$1 \n";
print $2, "\n";
print "$2 \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End Foreach

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

5 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm

# Save entire list as an array.
@list = <LISTFILE>;
print "@list", "\n";
while ($listitem = <LISTFILE>) {
print $listitem;
if ($line =~ m\$listitem\) {
print $line, "line here. \n";
print $1, "\n";
print "$1 \n";
print $2, "\n";
print "$2 \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End while

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

6 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$listitem = "";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
# Save entire list as an array.
@list = <LISTFILE>;
foreach $listitem(@list) {
chomp $listitem;
print $listitem;
$/ = undef;
$all = <FAAFILE>;
if ($all =~ m\/>gi/|(/d{9})\) {
print $line, "line here. \n";
print $1, "ab \n";
print "$1 ac \n";
print $2, "ad \n";
print "$2 ae \n";
if ($1 = $listitem) {
print "Got inside!", $listitem, "\n";
} # End if
} # End if
} # End foreach

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

7 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
while ($line = <FAAFILE>) {
@list = (<LISTFILE>);
if ($line =~ m\[>gi/|</d{9}>]\) {
print $1, "1 \n";
print $list, "\n";
print "Got here \n";
print $line, " line \n";
} # End if
} # End while

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

8 ----------------------------------------------------------

#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$list = ""; # Used while looking for elements of a list in a FASTA file

# Open necessary files
open FAAFILE, $faaname or die $!;
open LISTFILE, $listdata or die $!;

# Algorithm
while ($line = <FAAFILE>) {
@list = (<LISTFILE>);
if ($line =~ m$/d/d/d/d/d/d/d/d/d)$ {
print $1, "1 \n";
print $list, "\n";
print "Got here \n";
print $line, " line \n";
} # End if
} # End while

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
exit;

------------------------------------------------------------

There are a lot more, but hopefully these will give you an idea of what I was trying for I hope!

Once again, sorry that this is so terribly long...

prex1 · Jul 12, 2011

A few comments to start:
-you should post your code (and data) between [ignore]

Code:

[/ignore][i]your code here[/i][ignore]

[/ignore], this will make it more readable for us
-I assume, consistently with your code, that each record in the sequences file is a single line with fields separated by [tt]|[/tt]
-it could be important to know if your identifying numbers (I suppose they are fixed length) might start with leading zeroes (I'll assume they don't): in that case perl could drop the leading 0's depending on how they are used in the code
-the simplest solution to your problem (though not the only one) is to use a hash where the key are your numbers
This is a simplified version of what you need (untested):

Code:

open(LISTFILE,$listdata)or die $!;
my(%hashlist,$num);
while(<LISTFILE>){
  chomp;
  $hashlist{$_}=1;
}
open(FAAFILE,$faaname)or die $!;
open(FAAFILEOUT, ">>test6.txt");
while(<FAAFILE>){
  (undef,$num)=split/|/;
  if(exists$hashlist{$num}){
    print FAAFILEOUT;
  }
}

It is as simple as that!
In the code above I used some tricks of perl that might be difficult to understand to you (particularly the use of the special variable [tt]$_[/tt] ), however they make coding so easier and faster!
More particularly:
[tt]while(<LISTFILE>){[/tt] reads into [tt]$_[/tt];
[tt]chomp;[/tt] operates on [tt]$_[/tt];
[tt]split/|/;[/tt] operates on [tt]$_[/tt];
[tt]print FAAFILEOUT;[/tt] writes out [tt]$_[/tt] (that already includes the line terminator)

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

TinkerTok · Jul 12, 2011

I had to add an extra slash in the split for the |, but other than that this is super helpful! I need to figure out to get it to print the entire sequence and not just the headers, but it is finding and matching the headers perfectly and that helps a ton. Thanks a million!

TinkerTok · Jul 12, 2011

Done! Thank you SO much prex1! I have been fighting with this for so long and you totally made my day!

Just in case anyone in a biology lab would need this info in the future and to test posting code correctly, my final test version is:

code
#!/usr/bin/perl -w
#Program to pull out and manipulate specific sequences from one text file based on a given list in another text file.

# Set Variables
# text Input Variables
$faaname = "t1.txt";
$listdata = "list.txt";
$line = ""; # Used while going through text file
$seq = "";
$num = "";
$bool = 0;

# Algorithm
open(LISTFILE,$listdata)or die $!;
my (%hashlist);
while(<LISTFILE>){
chomp;
$hashlist{$_}=1;
}

open(FAAFILE,$faaname)or die $!;
open(FAAFILEOUT, ">>test6.txt");
while(<FAAFILE>){
$line = $_;
if ($line =~ m\^[>]\) {
if ($bool == 1) {
print FAAFILEOUT $seq;
$seq = "";
$bool = 0;
}
($seq, $num)=split/\|/;
if(exists$hashlist{$num}){
print FAAFILEOUT;
$seq = "";
$bool = 1;
}
}
else {
$seq = $seq.$line;
}
}

# Close out files and exit
close(FAAFILE);
close(LISTFILE);
close(FAAFILEOUT);
exit;
/code

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Need help parsing one text file based on data from a second text file

TinkerTok

Programmer

prex1

Programmer

TinkerTok

Programmer

TinkerTok

Programmer

Similar threads

Part and Inventory Search

Sponsor