Smart questions
Smart answers
Smart people
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Member Login




Remember Me
Forgot Password?
Join Us!

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips now!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

Join Tek-Tips
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Donate Today!

Do you enjoy these
technical forums?
Donate Today! Click Here

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.
Jobs from Indeed

Link To This Forum!

Partner Button
Add Stickiness To Your Site By Linking To This Professionally Managed Technical Forum.
Just copy and paste the
code below into your site.

Seek help on Generating multiple 8 base-pair long dependent motif Helpful Member! 

Everwood (TechnicalUser) (OP)
21 Jul 05 15:16
Hi all!

I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.

The thread is:

(1)First,randomly select 4 pairs of nucleotides as the group A.
 
For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT          

      
Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.

Any pair in group A should be different from the one in group B.

(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).

Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:

CA --- 1st position                       
CC --- 2nd position                    
CG --- 3rd position                       
TC --- 4th position

CACCCGTC----the 8 bp long binding motif site

We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.

Then we can get a random motif like: GGAAATCG


(3)Repeat this to make another 99 binding motif sites.



I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!

Alex



The previous code is:

## $len = length of the sequence - 1
$len = 2;

## $ar[i] letter in position i with probability $ap[i]
## other letters have the same probability (in our case 0.05)
## you have to build the arrays
$ar[0] = "T";
$ap[0] = 0.85;

$ar[1] = "C";
$ap[1] = 0.85;

$ar[2] = "A";
$ap[2] = 0.85;

for ($i=0;$i<=$len;$i++) {

    $ar_other[$i][0] = "A";
    $ar_other[$i][1] = "C";
    $ar_other[$i][2] = "G";

    if ($ar[$i] eq "A") {
        $ar_other[$i][0] = "T";
    } elsif ($ar[$i] eq "C") {
        $ar_other[$i][1] = "T";
    } elsif ($ar[$i] eq "G") {
        $ar_other[$i][2] = "T";
    }

    $p = rand(1);
    #print "$p";
    if ($p<(1-$ap[$i])) {
        $frac = (1-$ap[$i])/3;
        if ($p<(1-$ap[$i]-(2*$frac))) {
                $ran_seq[$i] = $ar_other[$i][0];
        } elsif ($p<(1-$ap[$i]-$frac)) {
                $ran_seq[$i] = $ar_other[$i][1];
        }  else {
                $ran_seq[$i] = $ar_other[$i][2];
        }
    } else {

        $ran_seq[$i] = $ar[$i];

    }

}

## you have your sequence in the array @ran_seq
for ($i=0;$i<=$len;$i++) {
    print "$ran_seq[$i] ";
}
print "\n";


A overly longer code is:

#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
#    Define consensus motif and alphabet
#    For the required number of repetitions
#       For each position in motif
#          Define an appropriate probability distribution over the
alphabet
#          Produce a character according to the distribution
#       ...
#    ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities  A: 0.4, C: 0.3, G: 0.2, T: 0.1
#     Intervals      A: 0.4  C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
#   0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
#    +----+----+----+----+----+----
+----+----+----+----+
#    |         A         |      C       |    G    | T  |
#    +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================

use strict;
use warnings;


my $N_Sequences = 100;                # Number of sequences to be
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
concatenation
                                      # of the -35 and -10 consensus
elements
                                      # of prokaryotic promoters;
replace with
                                      # whatever ...
my @Alphabet = split(//,'ACGT');      # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85;               # Replace with whatever ...

# ====== Globals ==========================
my @Probabilities;              # Stores the probability of each
character
# ====== Program ==========================


for (my $i=0; $i < $N_Sequences; $i++) {
    for (my $j=0; $j < scalar(@Motif); $j++) {

        loadConsensusCharacter($Motif[$j]);    # Load the character for
this position
        addNoiseToDistribution();              # Modify probabilities
according to noise model
        convertToIntervals();
        print(getRandomCharacter(rand(1.0)));  # Output character
    }
print "\n";
}

exit();

# ====== Subroutines =======================
#
sub loadConsensusCharacter {
    my ($char) = @_;
    my $Found = 'FALSE';

    for (my $i=0; $i < scalar(@Alphabet); $i++) {
        if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
            $Probabilities[$i] = 1.0;
            $Found = 'TRUE';
        } else {
            $Probabilities[$i] = 0.0;
        }
    }
    if ($Found eq 'FALSE') {
    die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
    }

return();
}

# ==========================================
sub addNoiseToDistribution {

    # This implements a very simple noise model:
    # We work on an array in which one element is 1.0 and
    # all others are 0.0. The element with 1.0 has the same
    # index as the consensus character in the Alphabet.
    #
    # We change that value to the consensus probability and
    # we distribute the remaining probability equally among
    # all other elements.
    #
    # It should be straightforward to implement more elaborate
    # noise models, or use a position specific scoring matrix
    # or something else here.

    my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);

    for (my $i=0; $i < scalar(@Probabilities); $i++) {
        if ( $Probabilities[$i] == 1.0 ) {     # ... this is the consensus
character
            $Probabilities[$i] = $P_Consensus;
        } else {
            $Probabilities[$i] = $P_NonConsensus;
        }
    }

    return();
}

# ==========================================
sub convertToIntervals {

    my $Sum = 0;

    # Convert the values of the probabilities array to the top
boundaries
    # of intervals having widths proportional to their relative
frequency.
    # Numbers range from 0 to 1.0  ...

    for (my $i=1; $i < scalar(@Probabilities); $i++) {
        $Probabilities[$i] += $Probabilities[$i-1];
    }

    return();
}

# ==========================================
sub getRandomCharacter {

    my ($RandomNumber) = @_;
    my $i=0;

    # Test which interval contains the RandomNumber ...
    # The index of the interval points to the Event we should choose.

    for ($i=0; $i < scalar(@Probabilities); $i++) {
        if ($Probabilities[$i] > $RandomNumber) { last; }
    }

    return($Alphabet[$i]);
}
Helpful Member!  TrojanWarBlade (Programmer)
21 Jul 05 17:16
I don't know if this helps or answers any of your questions but I had a go anyway:

CODE

#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  push @major, $temp[int(rand(scalar(@temp)))];
}
print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
}
print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print "@result\n";
You can obviously loop for 99 times if you want to.

Trojan.
Everwood (TechnicalUser) (OP)
21 Jul 05 17:27
Hi Trojan,

Thank you very much first of all.

What I mentioned the probabilty of 0.85 of pair
in the group A (major group in your case)means
that it can be randomly selected 85 times out of
100 times vs. the pair in group B 15 times out of
100 times.

I found you assigned 0.85 to the $probab in the code.
Can you clarifying this ?

Thank you for help!

Alex
TrojanWarBlade (Programmer)
21 Jul 05 17:28
Slightly modified version here that eliminates the chance that two pairs can be identical in one nucleotide:

CODE

#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  push @major, $temp[$index];
  splice(@temp, $index, 1);
}
print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
  splice(@temp, $index, 1);
}
print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print "@result\n";



Trojan.
TrojanWarBlade (Programmer)
21 Jul 05 17:30
Yes, no problem.
The rand(1.0) will generate a random number between 0.0000 and 0.99999 (ish).
Therefore when my code says:

CODE

push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
If the rand(1.0) prduces something above 0.85 (say) 0.92364 then yuo get the minor pair, otherwise you get the major pair.
This means that 85% of the time (generally) you get the major pair.

Hope that helps explain.


Trojan.
Everwood (TechnicalUser) (OP)
21 Jul 05 17:31
did not get you here.

"two pairs can be identical in one nucleotide"?

"one nucleotide" means "a position" in a binding site?
Everwood (TechnicalUser) (OP)
21 Jul 05 17:35
Trojan,

Can u show a generated motif here?
TrojanWarBlade (Programmer)
21 Jul 05 17:37
Sorry, genetics is not my thing so I'm trying.

The "splice" that I added was to remove the chance of getting something like this:

AA CG TA CG

It removes the "CG" from the "@temp" list when it's been used so that it cannot be selected again.

I did not know if duplicate pairs were allowed so I gave you both options.


Trojan.
TrojanWarBlade (Programmer)
21 Jul 05 17:38
It's at the end "print @result"

(I hope I'm understanding this enough and not making a fool of myself here!!!)


Trojan.
Everwood (TechnicalUser) (OP)
21 Jul 05 17:41
Hi Trojan,

I would say "duplicate pairs" are not allowed
to appear.

We will have two groups A and B (major and minor here).
Each of them is different from another pair. So the ideal
result should never be like AA CG TA CG.

Regards,
  Alex
TrojanWarBlade (Programmer)
21 Jul 05 17:43
That's what I thought and that's why I gave you the second version.  It's removes that possibility.

The motif is in @result and therefore you may only need to print that.


Trojan.
Everwood (TechnicalUser) (OP)
21 Jul 05 17:44
thank you, trojan!
TrojanWarBlade (Programmer)
21 Jul 05 17:46
Here's a sample run of 100:
CAAGACAA
ACAGTTCA
CCGAATTC
GATACTAA
GTTTCACC
TCATCTCA
CCGCCGAT
TATTGTCA
ACGTAACG
TGAGGCTT
CAAATTAC
TCGACGGG
GGTCAGGT
ATAAGGCA
ACGGGCTG
CTCCTCGT
GTAATCAT
GACAAGAA
AACCGTGA
GTCCACGG
AAGCCGGG
CTTGTCCT
GTTCTTAC
AATGCCGG
TAGCAGGA
GTCTTGAG
AGATCTGA
ATTTGATG
TGTTGGCG
CACTGGTT
AGTGAAGG
CTATGATC
GCGATCTA
AATCCAAT
TGATCGCA
TTTGAGAA
TGGTATCT
TCTGGACG
AACCCTGA
TTCCACCT
CGCTCCTG
GACTAGGG
CCAAGCAT
CAACGTTA
TCCTTAAA
TCCCGCAT
TCGCCATT
AAGCTCTC
GTACCGTT
CCAATCCT
CAATCGTC
AGTCGCTA
TACACGAG
TTCGGTAT
ACCAATCG
TGGTGATA
ACCTATTG
TTCCGTAA
CGGCATCT
AAACATGA
CTTATAAT
CGAAGTGT
CCCCCATG
GACTAGAC
CAAAGTTA
AAGTAGTC
TTACGTCA
GACTGGAT
AATATTGG
CTGGGTAT
CTTCAGGT
TCTGTTGA
AAGTGCGT
ACCTTCGT
TAGAGACG
ACAGTCGA
AACGTCTT
GGCATGAT
TGGTTTAT
TTAACTAG
GTAATGGC
CTTGTCAT
TCTGCACA
ACTCAGAA
CTGACATG
ATTCAACC
TAGTCAAC
GACACGGT
AAGACTGC
ACGAGGCC
GTACGGTC
AGCCAGTG
CGTTGATC
CTTCAAAT
ACGTTAGA
CCAACGTC
TTCCGTGC
TTCGGTTC
TCCAATAC


And here's the final code that created it:

CODE

#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

for(my $j=0; $j<99; $j++) {
my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  push @major, $temp[$index];
  splice(@temp, $index, 1);
}
#print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
  splice(@temp, $index, 1);
}
#print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print join("", @result), "\n";
}
Is this what you wanted?

Trojan.
TrojanWarBlade (Programmer)
21 Jul 05 17:47
Whoops, sorry, change the 99 to 100!


Trojan.
Everwood (TechnicalUser) (OP)
21 Jul 05 17:50
yeah,that's it! Wonderful! Appreciated!

Alex
TrojanWarBlade (Programmer)
21 Jul 05 17:52
Cool, glad to help, sorry I didn't understand the terminology properly.  Maybe I'll learn in time.

Good luck.  


Trojan.
KevinADC (TechnicalUser)
22 Jul 05 4:08
this was only an excersize for fun:

CODE

#!perl
use strict;
use Benchmark qw(:all);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;
my $reps = 1000;
my $r = timethese($reps, {
   'Kevin' => \&Kevin,
   'Trojan' => \&Trojan,
});
cmpthese($r);

sub Kevin {    
   my %results = ();
   for (0..99) {
      my %base_pairs = ();
      my $motif = '';
      until (scalar keys %base_pairs == 8) {
         my $index = int rand @fullset;
         $base_pairs{$fullset[$index]} = $fullset[$index];
      }
      my @base_pairs = keys %base_pairs;
      my @major = @base_pairs[0..3];
      my @minor = @base_pairs[4..7];
      $motif .= rand(1) > $probab ? $minor[$_] : $major[$_] for (0..3);
      redo if exists $results{$motif};
      $results{$motif} = $motif;
   }
   print "$_\n" for keys %results;
}

sub Trojan {
for(my $j=0; $j<99; $j++) {
   my $i;
   my @major   = ();    # Cluster of 4 nucleotides (A)
   my @minor   = ();    # Cluster of 4 nucleotides (B)
  
   # Pick 4 random pairs
   my @temp    = @fullset;
   for($i = 0; $i<4; ++$i) {
     my $index = int(rand(scalar(@temp)));
     push @major, $temp[$index];
     splice(@temp, $index, 1);
   }
   #print "@major\n";
   
   # Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
   for($i = 0; $i<4; ++$i) {
     my $index = int(rand(scalar(@temp)));
     redo unless($major[$i] ne $temp[$index]);
     push @minor, $temp[$index];
     splice(@temp, $index, 1);
   }
   #print "@minor\n";
   
   # Generate resultant set based on previously defined probability
   my @result = ();
   for($i = 0; $i<4; ++$i) {
     push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
   }
   print join("", @result), "\n";
}
}

with the results printed, Trojans code was typically a bit slower than my code (25% approx), but with the results not printed Trojans code was faster than mine (by about the same percentage).
KevinADC (TechnicalUser)
22 Jul 05 4:24
I think you have this reversed Trojan:

CODE

push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];

should be:

CODE

push @result, rand(1.0) <= $probab ? $major[$i] : $minor[$i];


if the rand() function returns a value greater than the probability factor the minor sequence should be used, not the major sequence, which is what is happening in the first line above.
TrojanWarBlade (Programmer)
22 Jul 05 4:41
Well spotted Kevin, I think you're right there.

Not quite so sure about your algorithm for selecting the pairs though.  Would it be fair to say that with your algorithm, @minor could not possibly contain any pairs that appear anywhere in @major and vice versa?
If that's the case, you'll see from the original supplied example that it should be possible to have the same values, as long as they do not share the same index position.


Trojan.
KevinADC (TechnicalUser)
22 Jul 05 15:11
Trojan,

I admit I was (still am) confused if identical base-pairs could be in major and minor as long they were not in same index. But since there could never be duplicated base-pairs in the motif (I think) it seemed easier just to make sure both arrays contained no duplicate base-pairs to begin with. That might make a difference in the final outcome of the generated list, but honestly I am not sure.

Also, the implication of my benchmarking test is that the join() function is extremely slow and looks like it can drag down the run time of an entire script. You might want to look into that if you use join() often in code.

I might play around with my code a little to see if I can speed it up.

Regards,
Kevin

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Back To Forum

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close