INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

bioinformatic problem urgent!

bioinformatic problem urgent!

(OP)
Hi I am new to this forum. I had a query regarding converting DNA code to proteins... which is simple by explanation but I find it hard to get accurate results with code. For ex: In a code like ATGTACTAT (here every 3 non overlapping alphabets get replaced by a single alphabet. For ex: ATG -> M; TAC-> Y; TAT -> Y. I am very new to AWK, I tried making a code below but it doesnt work accurately. can u fix it? thanks in advance



awk 'BEGIN{
c["ATG"]="M";  c["TTT"]="F"; c["TTC"]="F"; c["TTA"]="L"; c["TTG"]="L"; c["CTT"]="L"; c["CTC"]="L"; c["CTA"]="L"; c["CTG"]="L"; c["ATT"]="I"; c["ATC"]="I";
c["ATA"]="I"; c["GTT"]="V"; c["GTC"]="V"; c["GTA"]="V"; c["GTG"]="V"; c["TCT"]="S"; c["TCC"]="S"; c["TCA"]="S"; c["TCG"]="S"; c["CCT"]="P"; c["CCC"]="P";
c["CCA"]="P"; c["CCG"]="P"; c["ACT"]="T"; c["ACC"]="T"; c["ACA"]="T"; c["ACG"]="T"; c["GCT"]="A"; c["GCC"]="A"; c["GCA"]="A"; c["GCG"]="A";c["TAT"]="Y";
c["TAC"]="Y"; c["CAT"]="H"; c["CAC"]="H"; c["CAA"]="Q"; c["CAG"]="Q"; c["AAT"]="N"; c["AAC"]="N"; c["AAA"]="K"; c["AAG"]="K"; c["GAT"]="D"; c["GAC"]="D";
c["GAA"]="E"; c["GAG"]="E"; c["TGT"]="C"; c["TGC"]="C"; c["TGG"]="W"; c["CGT"]=R; c["CGC"]=R; c["CGA"]=R; c["CGG"]=R; c["AGA"]=R; c["AGG"]=R; c["AGT"]="S";
c["AGC"]="S"; c["GGT"]="G"; c["GGC"]="G"; c["GGA"]="G"; c["GGG"]="G";}
{i=1; p=""}
{do {
s=substr($0,i,3)
printf ("%s",s)
{if (c[s]==""){p=p" "} else {p=p c[s]" "}}
i=i+3}
while (s!="")}
{printf("\n%s\n",p)} ' genes_contig0028.txt
 

RE: bioinformatic problem urgent!

Hi

Quote (0210828176):

I tried making a code below but it doesnt work accurately.
Supposing the input is that "ATGTACTAT" you specified, your code gives the result "M Y Y". As far as I understood, that is exactly what you expect.

If I got it wrong, please post some sample input and the desired output.

Maybe also specify the used Awk implementation and version.

Feherke.
http://feherke.github.com/

RE: bioinformatic problem urgent!

Just for curiosity I tried it to code:

CODE

#! /bin/awk -f
BEGIN{
  c["ATG"]="M"; c["TTT"]="F"; c["TTC"]="F"; c["TTA"]="L"; c["TTG"]="L";
  c["CTT"]="L"; c["CTC"]="L"; c["CTA"]="L"; c["CTG"]="L"; c["ATT"]="I";
  c["ATC"]="I"; c["ATA"]="I"; c["GTT"]="V"; c["GTC"]="V"; c["GTA"]="V";
  c["GTG"]="V"; c["TCT"]="S"; c["TCC"]="S"; c["TCA"]="S"; c["TCG"]="S";
  c["CCT"]="P"; c["CCC"]="P"; c["CCA"]="P"; c["CCG"]="P"; c["ACT"]="T";
  c["ACC"]="T"; c["ACA"]="T"; c["ACG"]="T"; c["GCT"]="A"; c["GCC"]="A";
  c["GCA"]="A"; c["GCG"]="A"; c["TAT"]="Y"; c["TAC"]="Y"; c["CAT"]="H";
  c["CAC"]="H"; c["CAA"]="Q"; c["CAG"]="Q"; c["AAT"]="N"; c["AAC"]="N";
  c["AAA"]="K"; c["AAG"]="K"; c["GAT"]="D"; c["GAC"]="D"; c["GAA"]="E";
  c["GAG"]="E"; c["TGT"]="C"; c["TGC"]="C"; c["TGG"]="W"; c["CGT"]="R";
  c["CGC"]="R"; c["CGA"]="R"; c["CGG"]="R"; c["AGA"]="R"; c["AGG"]="R";
  c["AGT"]="S"; c["AGC"]="S"; c["GGT"]="G"; c["GGC"]="G"; c["GGA"]="G";
  c["GGG"]="G";
}

{
  old_line = $0
  new_line = ""
  char3 = "xxx"
  i = 1
  while (char3) {
    # get 3 chars from line
    char3 = substr(old_line, i, 3)
    if (char3) {
      #printf "char3 = '%s'\n", char3
      if (char3 in c) {
        new_line = new_line c[char3]
        #printf "* new_line = '%s'\n", new_line
      }
      else {
        printf "* Error: key '%s' not found in array c !\n", char3
      }
    }
    # move to the next 3 chars
    i += 3  
  }
  printf "old: '%s' ==> new: '%s'\n", old_line, new_line  
}
For following example data

CODE

ACTCGCTAT
GCGTGGAAA
TACGAGACT
it outputs

CODE

old: 'ACTCGCTAT' ==> new: 'TRY'
old: 'GCGTGGAAA' ==> new: 'AWK'
old: 'TACGAGACT' ==> new: 'YET'

RE: bioinformatic problem urgent!

Anywy, a simpler way for OP's code:

CODE

awk 'BEGIN{
c["ATG"]="M"
c["TTT"]=c["TTC"]="F"
c["TTA"]=c["TTG"]=c["CTT"]=c["CTC"]=c["CTA"]=c["CTG"]="L"
c["ATT"]=c["ATC"]=c["ATA"]="I"
c["GTT"]=c["GTC"]=c["GTA"]=c["GTG"]="V"
c["TCT"]=c["TCC"]=c["TCA"]=c["TCG"]=c["AGT"]=c["AGC"]="S"
 c["CCT"]=c["CCC"]=c["CCA"]=c["CCG"]="P"
c["ACT"]=c["ACC"]=c["ACA"]=c["ACG"]="T"
c["GCT"]=c["GCC"]=c["GCA"]=c["GCG"]="A"
c["TAT"]=c["TAC"]="Y"
c["CAT"]=c["CAC"]="H"
c["CAA"]=c["CAG"]="Q"
c["AAT"]=c["AAC"]="N"
c["AAA"]=c["AAG"]="K"
c["GAT"]=c["GAC"]="D"
c["GAA"]=c["GAG"]="E"
c["TGT"]=c["TGC"]="C"
c["TGG"]="W"
c["CGT"]=c["CGC"]=c["CGA"]=c["CGG"]=c["AGA"]=c["AGG"]="R"
c["GGT"]=c["GGC"]=c["GGA"]=c["GGG"]="G"
}
{print;p="";for(i=1;i<=length($0);i+=3)p=p c[substr($0,i,3)]" ";print p}
' genes_contig0028.txt

Hope This Helps, PH.
FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close