Hi all!

I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.

The thread is:

(1)First,randomly select 4 pairs of nucleotides as the group A.

For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :

AA,AC,AG,AT,

CA,CC,CG,CT,

GA,GC,GG,GT,

TA,TC,TG,TT

Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.

Any pair in group A should be different from the one in group B.

(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).

Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:

CA --- 1st position

CC --- 2nd position

CG --- 3rd position

TC --- 4th position

CACCCGTC----the 8 bp long binding motif site

We also can form another group A like GG,GT,AT,CG and group

B like GA,AA,AC,CC.

Then we can get a random motif like: GGAAATCG

(3)Repeat this to make another 99 binding motif sites.

I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!

Alex

The previous code is:

## $len = length of the sequence - 1

$len = 2;

## $ar[i] letter in position i with probability $ap[i]

## other letters have the same probability (in our case 0.05)

## you have to build the arrays

$ar[0] = "T";

$ap[0] = 0.85;

$ar[1] = "C";

$ap[1] = 0.85;

$ar[2] = "A";

$ap[2] = 0.85;

for ($i=0;$i<=$len;$i++) {

$ar_other[$i][0] = "A";

$ar_other[$i][1] = "C";

$ar_other[$i][2] = "G";

if ($ar[$i] eq "A") {

$ar_other[$i][0] = "T";

} elsif ($ar[$i] eq "C") {

$ar_other[$i][1] = "T";

} elsif ($ar[$i] eq "G") {

$ar_other[$i][2] = "T";

}

$p = rand(1);

#print "$p";

if ($p<(1-$ap[$i])) {

$frac = (1-$ap[$i])/3;

if ($p<(1-$ap[$i]-(2*$frac))) {

$ran_seq[$i] = $ar_other[$i][0];

} elsif ($p<(1-$ap[$i]-$frac)) {

$ran_seq[$i] = $ar_other[$i][1];

} else {

$ran_seq[$i] = $ar_other[$i][2];

}

} else {

$ran_seq[$i] = $ar[$i];

}

}

## you have your sequence in the array @ran_seq

for ($i=0;$i<=$len;$i++) {

print "$ran_seq[$i] ";

}

print "\n";

A overly longer code is:

#!/usr/bin/perl

# randommotifs.pl

#

# This program generates randomized instances of pseudomotifs, i.e.

# sequences that are obtained by adding "noise", according to a

specified

# model to a consensus sequence. To keep this reasonably general, we

# proceed through the following steps:

#

# Define consensus motif and alphabet

# For the required number of repetitions

# For each position in motif

# Define an appropriate probability distribution over the

alphabet

# Produce a character according to the distribution

# ...

# ...

#

# The probability distributions are generated to produce a consensus

character

# with a particular "consensus probability" and all other characters

# with equal probabilites (the noise). This is defined in a single

subroutine

# so it is straightforward to change this for a different model of noise

# (e.g. increase the probabilities for "similar" characters). It's easy

# to become creative here and incorporate (biological) knowledge into

the

# noise-model.

#

# Probablities are then converted into cumulative intervals to chose a

# particular character:

#

# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1

# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0

#

# This example can be pictured as

#

# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

# +----+----+----+----+----+----

+----+----+----+----+

# | A | C | G | T |

# +----+----+----+----+----+----+----+----+----+----+

#

# .. now all we need to do is generate a random number between 0 and 1

# and see in which interval it falls. This then points to the character

to be

# written to output.

#

# B. Steipe, July 2005. Public domain code.

# =====================================================================

use strict;

use warnings;

my $N_Sequences = 100; # Number of sequences to be

generated

my @Motif = split(//,'TTGACATATAAT'); # This example is simply a

concatenation

# of the -35 and -10 consensus

elements

# of prokaryotic promoters;

replace with

# whatever ...

my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with

whatever ...

my $P_Consensus = 0.85; # Replace with whatever ...

# ====== Globals ==========================

my @Probabilities; # Stores the probability of each

character

# ====== Program ==========================

for (my $i=0; $i < $N_Sequences; $i++) {

for (my $j=0; $j < scalar(@Motif); $j++) {

loadConsensusCharacter($Motif[$j]); # Load the character for

this position

addNoiseToDistribution(); # Modify probabilities

according to noise model

convertToIntervals();

print(getRandomCharacter(rand(1.0))); # Output character

}

print "\n";

}

exit();

# ====== Subroutines =======================

#

sub loadConsensusCharacter {

my ($char) = @_;

my $Found = 'FALSE';

for (my $i=0; $i < scalar(@Alphabet); $i++) {

if ( $char eq $Alphabet[$i]) { # found character in i_th

position in Alphabet

$Probabilities[$i] = 1.0;

$Found = 'TRUE';

} else {

$Probabilities[$i] = 0.0;

}

}

if ($Found eq 'FALSE') {

die("Panic: Motif-Character\"$char\" was not found in Alphabet.

Aborting.\n");

}

return();

}

# ==========================================

sub addNoiseToDistribution {

# This implements a very simple noise model:

# We work on an array in which one element is 1.0 and

# all others are 0.0. The element with 1.0 has the same

# index as the consensus character in the Alphabet.

#

# We change that value to the consensus probability and

# we distribute the remaining probability equally among

# all other elements.

#

# It should be straightforward to implement more elaborate

# noise models, or use a position specific scoring matrix

# or something else here.

my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);

for (my $i=0; $i < scalar(@Probabilities); $i++) {

if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus

character

$Probabilities[$i] = $P_Consensus;

} else {

$Probabilities[$i] = $P_NonConsensus;

}

}

return();

}

# ==========================================

sub convertToIntervals {

my $Sum = 0;

# Convert the values of the probabilities array to the top

boundaries

# of intervals having widths proportional to their relative

frequency.

# Numbers range from 0 to 1.0 ...

for (my $i=1; $i < scalar(@Probabilities); $i++) {

$Probabilities[$i] += $Probabilities[$i-1];

}

return();

}

# ==========================================

sub getRandomCharacter {

my ($RandomNumber) = @_;

my $i=0;

# Test which interval contains the RandomNumber ...

# The index of the interval points to the Event we should choose.

for ($i=0; $i < scalar(@Probabilities); $i++) {

if ($Probabilities[$i] > $RandomNumber) { last; }

}

return($Alphabet[$i]);

}