×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
• Talk With Other Members
• Be Notified Of Responses
• Keyword Search
Favorite Forums
• Automated Signatures
• Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Seek help on Generating multiple 8 base-pair long dependent motif

Seek help on Generating multiple 8 base-pair long dependent motif

(OP)
Hi all!

I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.

(1)First,randomly select 4 pairs of nucleotides as the group A.

For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT

Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.

Any pair in group A should be different from the one in group B.

(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).

Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:

CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position

CACCCGTC----the 8 bp long binding motif site

We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.

Then we can get a random motif like: GGAAATCG

(3)Repeat this to make another 99 binding motif sites.

I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!

Alex

The previous code is:

## $len = length of the sequence - 1$len = 2;

## $ar[i] letter in position i with probability$ap[i]
## other letters have the same probability (in our case 0.05)
## you have to build the arrays
$ar[0] = "T";$ap[0] = 0.85;

$ar[1] = "C";$ap[1] = 0.85;

$ar[2] = "A";$ap[2] = 0.85;

for ($i=0;$i<=$len;$i++) {

$ar_other[$i][0] = "A";
$ar_other[$i][1] = "C";
$ar_other[$i][2] = "G";

if ($ar[$i] eq "A") {
$ar_other[$i][0] = "T";
} elsif ($ar[$i] eq "C") {
$ar_other[$i][1] = "T";
} elsif ($ar[$i] eq "G") {
$ar_other[$i][2] = "T";
}

$p = rand(1); #print "$p";
if ($p<(1-$ap[$i])) {$frac = (1-$ap[$i])/3;
if ($p<(1-$ap[$i]-(2*$frac))) {
$ran_seq[$i] = $ar_other[$i][0];
} elsif ($p<(1-$ap[$i]-$frac)) {
$ran_seq[$i] = $ar_other[$i][1];
}  else {
$ran_seq[$i] = $ar_other[$i][2];
}
} else {

$ran_seq[$i] = $ar[$i];

}

}

## you have your sequence in the array @ran_seq
for ($i=0;$i<=$len;$i++) {
print "$ran_seq[$i] ";
}
print "\n";

A overly longer code is:

#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
#    Define consensus motif and alphabet
#    For the required number of repetitions
#       For each position in motif
#          Define an appropriate probability distribution over the
alphabet
#          Produce a character according to the distribution
#       ...
#    ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities  A: 0.4, C: 0.3, G: 0.2, T: 0.1
#     Intervals      A: 0.4  C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
#   0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
#    +----+----+----+----+----+----
+----+----+----+----+
#    |         A         |      C       |    G    | T  |
#    +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================

use strict;
use warnings;

my $N_Sequences = 100; # Number of sequences to be generated my @Motif = split(//,'TTGACATATAAT'); # This example is simply a concatenation # of the -35 and -10 consensus elements # of prokaryotic promoters; replace with # whatever ... my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with whatever ... my$P_Consensus = 0.85;               # Replace with whatever ...

# ====== Globals ==========================
my @Probabilities;              # Stores the probability of each
character
# ====== Program ==========================

for (my $i=0;$i < $N_Sequences;$i++) {
for (my $j=0;$j < scalar(@Motif); $j++) { loadConsensusCharacter($Motif[$j]); # Load the character for this position addNoiseToDistribution(); # Modify probabilities according to noise model convertToIntervals(); print(getRandomCharacter(rand(1.0))); # Output character } print "\n"; } exit(); # ====== Subroutines ======================= # sub loadConsensusCharacter { my ($char) = @_;
my $Found = 'FALSE'; for (my$i=0; $i < scalar(@Alphabet);$i++) {
if ( $char eq$Alphabet[$i]) { # found character in i_th position in Alphabet$Probabilities[$i] = 1.0;$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') { die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}

return();
}

# ==========================================

# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.

my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);

for (my $i=0;$i < scalar(@Probabilities); $i++) { if ($Probabilities[$i] == 1.0 ) { # ... this is the consensus character$Probabilities[$i] =$P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus; } } return(); } # ========================================== sub convertToIntervals { my$Sum = 0;

# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0  ...

for (my $i=1;$i < scalar(@Probabilities); $i++) {$Probabilities[$i] +=$Probabilities[$i-1]; } return(); } # ========================================== sub getRandomCharacter { my ($RandomNumber) = @_;
my $i=0; # Test which interval contains the RandomNumber ... # The index of the interval points to the Event we should choose. for ($i=0; $i < scalar(@Probabilities);$i++) {
if ($Probabilities[$i] > $RandomNumber) { last; } } return($Alphabet[$i]); } RE: Seek help on Generating multiple 8 base-pair long dependent motif I don't know if this helps or answers any of your questions but I had a go anyway: CODE #!/usr/bin/perl -w use strict; my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT); my$probab  = 0.85;

my $i; my @major = (); # Cluster of 4 nucleotides (A) my @minor = (); # Cluster of 4 nucleotides (B) # Pick 4 random pairs my @temp = @fullset; for($i = 0; $i<4; ++$i) {
push @major, $temp[int(rand(scalar(@temp)))]; } print "@major\n"; # Pick 4 random pairs (excluding matches to @major) @temp = @fullset; for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp))); redo unless($major[$i] ne$temp[$index]); push @minor,$temp[$index]; } print "@minor\n"; # Generate resultant set based on previously defined probability my @result = (); for($i = 0; $i<4; ++$i) {
push @result, rand(1.0) > $probab ?$major[$i] :$minor[$i]; } print "@result\n"; You can obviously loop for 99 times if you want to. Trojan. RE: Seek help on Generating multiple 8 base-pair long dependent motif (OP) Hi Trojan, Thank you very much first of all. What I mentioned the probabilty of 0.85 of pair in the group A (major group in your case)means that it can be randomly selected 85 times out of 100 times vs. the pair in group B 15 times out of 100 times. I found you assigned 0.85 to the$probab in the code.
Can you clarifying this ?

Thank you for help!

Alex

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Slightly modified version here that eliminates the chance that two pairs can be identical in one nucleotide:

CODE

#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab = 0.85; my$i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0;$i<4; ++$i) { my$index = int(rand(scalar(@temp)));
push @major, $temp[$index];
splice(@temp, $index, 1); } print "@major\n"; # Pick 4 random pairs (excluding matches to @major) @temp = @fullset; for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp))); redo unless($major[$i] ne$temp[$index]); push @minor,$temp[$index]; splice(@temp,$index, 1);
}
print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0;$i<4; ++$i) { push @result, rand(1.0) >$probab ? $major[$i] : $minor[$i];
}
print "@result\n";

Trojan.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Yes, no problem.
The rand(1.0) will generate a random number between 0.0000 and 0.99999 (ish).
Therefore when my code says:

push @result, rand(1.0) > $probab ?$major[$i] :$minor[$i]; If the rand(1.0) prduces something above 0.85 (say) 0.92364 then yuo get the minor pair, otherwise you get the major pair. This means that 85% of the time (generally) you get the major pair. Hope that helps explain. Trojan. RE: Seek help on Generating multiple 8 base-pair long dependent motif (OP) did not get you here. "two pairs can be identical in one nucleotide"? "one nucleotide" means "a position" in a binding site? RE: Seek help on Generating multiple 8 base-pair long dependent motif (OP) Trojan, Can u show a generated motif here? RE: Seek help on Generating multiple 8 base-pair long dependent motif Sorry, genetics is not my thing so I'm trying. The "splice" that I added was to remove the chance of getting something like this: AA CG TA CG It removes the "CG" from the "@temp" list when it's been used so that it cannot be selected again. I did not know if duplicate pairs were allowed so I gave you both options. Trojan. RE: Seek help on Generating multiple 8 base-pair long dependent motif It's at the end "print @result" (I hope I'm understanding this enough and not making a fool of myself here!!!) Trojan. RE: Seek help on Generating multiple 8 base-pair long dependent motif (OP) Hi Trojan, I would say "duplicate pairs" are not allowed to appear. We will have two groups A and B (major and minor here). Each of them is different from another pair. So the ideal result should never be like AA CG TA CG. Regards, Alex RE: Seek help on Generating multiple 8 base-pair long dependent motif That's what I thought and that's why I gave you the second version. It's removes that possibility. The motif is in @result and therefore you may only need to print that. Trojan. RE: Seek help on Generating multiple 8 base-pair long dependent motif (OP) thank you, trojan! RE: Seek help on Generating multiple 8 base-pair long dependent motif Here's a sample run of 100: CAAGACAA ACAGTTCA CCGAATTC GATACTAA GTTTCACC TCATCTCA CCGCCGAT TATTGTCA ACGTAACG TGAGGCTT CAAATTAC TCGACGGG GGTCAGGT ATAAGGCA ACGGGCTG CTCCTCGT GTAATCAT GACAAGAA AACCGTGA GTCCACGG AAGCCGGG CTTGTCCT GTTCTTAC AATGCCGG TAGCAGGA GTCTTGAG AGATCTGA ATTTGATG TGTTGGCG CACTGGTT AGTGAAGG CTATGATC GCGATCTA AATCCAAT TGATCGCA TTTGAGAA TGGTATCT TCTGGACG AACCCTGA TTCCACCT CGCTCCTG GACTAGGG CCAAGCAT CAACGTTA TCCTTAAA TCCCGCAT TCGCCATT AAGCTCTC GTACCGTT CCAATCCT CAATCGTC AGTCGCTA TACACGAG TTCGGTAT ACCAATCG TGGTGATA ACCTATTG TTCCGTAA CGGCATCT AAACATGA CTTATAAT CGAAGTGT CCCCCATG GACTAGAC CAAAGTTA AAGTAGTC TTACGTCA GACTGGAT AATATTGG CTGGGTAT CTTCAGGT TCTGTTGA AAGTGCGT ACCTTCGT TAGAGACG ACAGTCGA AACGTCTT GGCATGAT TGGTTTAT TTAACTAG GTAATGGC CTTGTCAT TCTGCACA ACTCAGAA CTGACATG ATTCAACC TAGTCAAC GACACGGT AAGACTGC ACGAGGCC GTACGGTC AGCCAGTG CGTTGATC CTTCAAAT ACGTTAGA CCAACGTC TTCCGTGC TTCGGTTC TCCAATAC And here's the final code that created it: CODE #!/usr/bin/perl -w use strict; my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT); my$probab  = 0.85;

for(my $j=0;$j<99; $j++) { my$i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0;$i<4; ++$i) { my$index = int(rand(scalar(@temp)));
push @major, $temp[$index];
splice(@temp, $index, 1); } #print "@major\n"; # Pick 4 random pairs (excluding matches to @major) @temp = @fullset; for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp))); redo unless($major[$i] ne$temp[$index]); push @minor,$temp[$index]; splice(@temp,$index, 1);
}
#print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0;$i<4; ++$i) { push @result, rand(1.0) >$probab ? $major[$i] : $minor[$i];
}
print join("", @result), "\n";
}
Is this what you wanted?

Trojan.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Whoops, sorry, change the 99 to 100!

Trojan.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

(OP)
yeah,that's it! Wonderful! Appreciated!

Alex

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Cool, glad to help, sorry I didn't understand the terminology properly.  Maybe I'll learn in time.

Good luck.

Trojan.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

this was only an excersize for fun:

CODE

#!perl
use strict;
use Benchmark qw(:all);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab = 0.85; my$reps = 1000;
my $r = timethese($reps, {
'Kevin' => \&Kevin,
'Trojan' => \&Trojan,
});
cmpthese($r); sub Kevin { my %results = (); for (0..99) { my %base_pairs = (); my$motif = '';
until (scalar keys %base_pairs == 8) {
my $index = int rand @fullset;$base_pairs{$fullset[$index]} = $fullset[$index];
}
my @base_pairs = keys %base_pairs;
my @major = @base_pairs[0..3];
my @minor = @base_pairs[4..7];
$motif .= rand(1) >$probab ? $minor[$_] : $major[$_] for (0..3);
redo if exists $results{$motif};
$results{$motif} = $motif; } print "$_\n" for keys %results;
}

sub Trojan {
for(my $j=0;$j<99; $j++) { my$i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0;$i<4; ++$i) { my$index = int(rand(scalar(@temp)));
push @major, $temp[$index];
splice(@temp, $index, 1); } #print "@major\n"; # Pick 4 random pairs (excluding matches to @major) @temp = @fullset; for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp))); redo unless($major[$i] ne$temp[$index]); push @minor,$temp[$index]; splice(@temp,$index, 1);
}
#print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0;$i<4; ++$i) { push @result, rand(1.0) >$probab ? $major[$i] : $minor[$i];
}
print join("", @result), "\n";
}
}

with the results printed, Trojans code was typically a bit slower than my code (25% approx), but with the results not printed Trojans code was faster than mine (by about the same percentage).

RE: Seek help on Generating multiple 8 base-pair long dependent motif

I think you have this reversed Trojan:

push @result, rand(1.0) > $probab ?$major[$i] :$minor[$i]; should be: CODE push @result, rand(1.0) <=$probab ? $major[$i] : $minor[$i];

if the rand() function returns a value greater than the probability factor the minor sequence should be used, not the major sequence, which is what is happening in the first line above.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Well spotted Kevin, I think you're right there.

Not quite so sure about your algorithm for selecting the pairs though.  Would it be fair to say that with your algorithm, @minor could not possibly contain any pairs that appear anywhere in @major and vice versa?
If that's the case, you'll see from the original supplied example that it should be possible to have the same values, as long as they do not share the same index position.

Trojan.

RE: Seek help on Generating multiple 8 base-pair long dependent motif

Trojan,

I admit I was (still am) confused if identical base-pairs could be in major and minor as long they were not in same index. But since there could never be duplicated base-pairs in the motif (I think) it seemed easier just to make sure both arrays contained no duplicate base-pairs to begin with. That might make a difference in the final outcome of the generated list, but honestly I am not sure.

Also, the implication of my benchmarking test is that the join() function is extremely slow and looks like it can drag down the run time of an entire script. You might want to look into that if you use join() often in code.

I might play around with my code a little to see if I can speed it up.

Regards,
Kevin

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

• Talk To Other Members
• Notification Of Responses To Questions
• Favorite Forums One Click Access
• Keyword Search Of All Posts, And More...

Register now while it's still free!