Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Quick regex question 4

Status
Not open for further replies.

Captainrave

Technical User
Nov 16, 2007
97
GB
I need to make a regular expression to match the first two characters and the last two characters of a string. The characters can be anything from A-Z and the string can be of any length.

For example:
ATCGTACGAT > I would keep
ATATCTTT > Would be discarded

Is this even possible with a regular expression?
 
In my head this works....
Code:
@data = qw (ATCGTACGAT ATATCTTT);

foreach $ent (@data) {
        $v1 = substr $ent,0,2;
        $v2 = substr $ent,-2,2;
        print "$ent - $v1 -- $v2\n" if ($v1 eq $v2);
}
 
How does that work PinkeyNBrain. I have very little experience with regular expressions.

This is an extension of what I was working on before. The code i have is:

Code:
while (my$line = <REPEATFILE>){
  my($firstcol)= split /,/, $line;

if($firstcol =~ m/^(..).*(/1)&/){
      print OUTFILE $line;
  }else{
    next;
  }
}

exit;
 
I think I understand it now. I think it is \1 rather than /1.
 
Now corrected it to this.

However I get a blank OUTPUT.

Code:
while (my$line = <REPEATFILE>){
  my($firstcol)= split /,/, $line;

if ($firstcol =~ m/^(..).*\1&/){
      print OUTFILE $line;
  }else{
    next;
  }
}

exit;

Not sure why though?
 
Go three keys to the left for your end of line matcher.
Use /^(..).*\1$/
Not /^(..).*\1&/

I'm seeing in my first reply that I used /^(..).*(\1)$/ with the extra ()'s at the end. I'm not seeing any benefit in the extra ()'s unless one wants to additionally show to themselves what is getting matched by printing $1 and $2. Note that within the re you use \1 to reference a previous match, outside you use $1 ($2 and so on). Don't know the nut-n-bolts as to why other then it may have been necessary in order to get the parser to behave.

I'm not the greatest at re's, but because of them (and a number of the other string functions) perl is one of my favorite languages by far.
 
Works like a charm. Thanks everyone (once again)!

Really starting to get a good understanding of regexs now.
 
if the data comes from a file the regexp works without having to chomp the strings, but in max1xs' example chomp would be necessary before using substr otherwise you get the trailing newline if there is one.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Another quick question. So I have the following code to filter my data:

Code:
while (my$line = <REPEATFILE>){
  my($firstcol)= split /,/, $line;

if ($firstcol =~ m/([acgt]{3})\1{1,99}/xig/){
      print OUTFILE $line;
  }else{
    next;
  }
}

exit;

Basically it is a more efficient way to remove trinucleotide repeats and it doesnt miss anything (what I was doing before). So it removes patterns like:

AATAATAAT or GCCGCCGCC.

However, it also removes mononucleotide repeats like:

AAAAAAAA or CCCCCCCC (can be any length).

I dont want these removed, and so what needs to be done is to work out a way that ignores elements where the letters are identical.

Is there a way to implement it so that it goes something like:

Code:
if ($firstcol =~ m/([acgt]{3})\1{1,99}/xig/){
      print OUTFILE $line;

EXCEPT when $firstcol =~ m/a*/xig
       when firstcol =~ m/c*/xig
       when firstcol =~ m/g*/xig
       when firstcol =~ m/t*/xig

How would I code that?

 
Did you want that last slash?
Code:
($firstcol =~ m/([acgt]{3})\1{1,99}/xig/)
                                       ^
Also, since you're using //'s to delimit your re, the leading "m" isn't needed. The leading "m" comes in if you needed to match something like $myvar =~ m"match/this" ;

As to your question, this may be what you're after
Code:
($firstcol =~ /([acgt]{3})\1+$/i) &&
   ($firstcol !~ /([acgt])\1+$/i)

 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top