Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Protein Amino Acid Sequence Matching... 2

Status
Not open for further replies.

scotttom

IS-IT--Management
Mar 5, 2002
143
US
Hi,

I've got a table of sequences of proteins.

ProteinID AASeq
1 MSTILEKISAIES..........

I've created a SP that passes a search string of up to 100 Protein Characters.

I need to populate tblMatch that has a record for substrings that have matches to the records in tblProtein greater than 20 characters in length.

I'm close, but I am getting duplicate records when I shift one character to the right in the search string.

What I need to do is to look for a 20 character match, expand that until it no longer matches, add that record to tblMatch, but then NOT include the match when my search string shifts one character to the right and THAT is what I am having trouble with.

Also I can't exclude based on ProteinID in tblMatch because I need to search the remaining part of the AA sequence as well for a >20 character match.

Always find great help in organizing my thought here. THanks in advance.

Best,

Scott

match
 
Can you show some more data? Particularly data that is causing you problems. Also include desired results.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
What have you got so far (you say you're close)? And what version of SQL Server are you using?

This comparison would probably be better made in a lower level language that loops more effectively (IMO)

Hope this helps,

Alex

Ignorance of certain subjects is a great part of wisdom
 
Ok Here is my sp..

Code:
CREATE PROCEDURE spInsertSpecificitesRecords  @SEQ as varchar(8000), @DesignID as int AS


Declare @SearchStr varchar(2000),
 @SearchStrMax varchar(2000),
 @startaa int,
 @endaa int,
 @aaLength int,
 @cntaa int

set @aaLength = LEN(@SEQ)
Set @startaa = 1
Set @endaa = 20
Set @cntaa = 20


while @startaa < @aaLength - 20

Begin

/*while @endaa <= @aaLength

Begin

*/

set @searchstr = SUBSTRING(@SEQ,@startaa,@cntaa)

while 


INSERT INTO tblmatch (ProteinID, DesignID, MatchFrom, Matchto) SELECT tblRDProteins.RDProteinsID, @DesignID,  PATINDEX('%' +@Searchstr + '%', tblRDProteins.FullAASeq), PATINDEX('%' +@Searchstr + '%', tblRDProteins.FullAASeq) + @cntaa
FROM tblRDProteins
WHERE (((tblRDProteins.FullAASeq) LIKE '%' +@Searchstr + '%'))


/*
Set @endaa = @endaa + 1

end
*/
set @startaa = @startaa + 1
set @cntaa = @cntaa +1


end
GO

Here is the query I use:

exec dbo.spInsertSpecificitesRecords 'MVDREQLVQKARLAEQAERYDDMAAAMKNVTELNEPLSNEERNL', 123546

And here are the results I'm getting.
MatchID ProtID Design From To

569 1020 123546 1 21
574 1020 123546 2 23
579 1020 123546 3 25
584 1020 123546 4 27
589 1020 123546 5 29
594 1020 123546 6 31
599 1020 123546 7 33
604 1020 123546 8 35
609 1020 123546 9 37
614 1020 123546 10 39
619 1020 123546 11 41
624 1020 123546 12 43
629 1020 123546 13 45
634 1020 123546 14 47
639 1020 123546 15 49
644 1020 123546 16 51
649 1020 123546 17 53
654 1020 123546 18 55
659 1020 123546 19 57
664 1020 123546 20 59
669 1020 123546 21 61
674 1020 123546 22 63

What I need is this

675 1020 123546 1 63


I reached the point that I can't wrap my brain around.

Thanks so much for the help!

Scott


 
It's difficult for me to tell if this will work for you, but it's at least worth a shot.

Code:
[COLOR=blue]CREATE[/color] [COLOR=blue]PROCEDURE[/color] spInsertSpecificitesRecords  @SEQ [COLOR=blue]as[/color] [COLOR=blue]varchar[/color](8000), @DesignID [COLOR=blue]as[/color] [COLOR=blue]int[/color] [COLOR=blue]AS[/color]


[COLOR=blue]Declare[/color] @SearchStr [COLOR=blue]varchar[/color](2000),
 @SearchStrMax [COLOR=blue]varchar[/color](2000),
 @startaa [COLOR=blue]int[/color],
 @endaa [COLOR=blue]int[/color],
 @aaLength [COLOR=blue]int[/color],
 @cntaa [COLOR=blue]int[/color]

[COLOR=blue]set[/color] @aaLength = LEN(@SEQ)
[COLOR=blue]Set[/color] @startaa = 1
[COLOR=blue]Set[/color] @endaa = 20
[COLOR=blue]Set[/color] @cntaa = 20

[COLOR=blue]Declare[/color] @Temp 
[COLOR=blue]Table[/color]   (ProteinId [COLOR=blue]Int[/color], 
        MatchFrom [COLOR=blue]Int[/color],
        MatchTo [COLOR=blue]Int[/color]
        )

[COLOR=blue]while[/color] @startaa < @aaLength - 20
  [COLOR=blue]Begin[/color]

    [COLOR=blue]set[/color] @searchstr = [COLOR=#FF00FF]SUBSTRING[/color](@SEQ,@startaa,@cntaa)

    [COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] @Temp(ProteinId, MatchFrom, MatchTo)
    [COLOR=blue]SELECT[/color]  tblRDProteins.RDProteinsID, 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq), 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq) + @cntaa
    [COLOR=blue]FROM[/color]    tblRDProteins
    [COLOR=blue]WHERE[/color]   (((tblRDProteins.FullAASeq) LIKE [COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color]))

    [COLOR=blue]set[/color] @startaa = @startaa + 1
    [COLOR=blue]set[/color] @cntaa = @cntaa +1

[COLOR=blue]end[/color]

[COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] tblmatch (ProteinID, DesignID, MatchFrom, Matchto) 
[COLOR=blue]Select[/color] ProteinId,
       @DesignID,
       [COLOR=#FF00FF]Min[/color](MatchFrom),
       [COLOR=#FF00FF]Max[/color](MatchTo)
[COLOR=blue]From[/color]   @Temp
[COLOR=blue]Group[/color] [COLOR=blue]By[/color] ProteinId

[COLOR=blue]GO[/color]

Notice that most of the code is the same. However, instead of inserting the record in to the permanent table for each iteration through the while loop, the data is inserted to a table variable instead. Then, after the loop is done, the ProteinId, min(MatchFrom) and Max(MatchTo) is inserted in to the real table.

Again, I think this will work for you, but some serious testing is suggested.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
George,

That is one path my brain was attempting to amble down. Thanks for the guidance.

One challenge I can see though, How can I handle a situation if I actually had a match from 1-30 and 150-190.

I would get a result of 1-190 correct?

Thanks again for your help.

Best Scott
 
Ah... yeah... that makes things a little more complicated. I have an idea that may help. Give me a couple minutes.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
OK. I was able to come up with something that may work. Again... Serious testing is in order.

Suppose the while loop produced this result set.

Code:
[COLOR=blue]Declare[/color] @Temp
[COLOR=blue]Table[/color]	([COLOR=#FF00FF]RowId[/color] [COLOR=blue]Int[/color] [COLOR=blue]Identity[/color](1,1), ProteinId [COLOR=blue]Int[/color], MatchFrom [COLOR=blue]Int[/color], MatchTo [COLOR=blue]Int[/color], Diff [COLOR=blue]Int[/color])

[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,1 ,21)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,2 ,23)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,3 ,25)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,4 ,27)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,5 ,29)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,6 ,31)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,7 ,33)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,8 ,35)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,9 ,37)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,10,39)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,11,41)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,12,43)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,13,45)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,14,47)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,215,249)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,216,251)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,217,253)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,218,255)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,319,357)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,320,359)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,321,361)
[COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] @Temp(ProteinId, MatchFrom, MatchTo) [COLOR=blue]Values[/color](1020,322,363)

Notice that I added a RowId Integer Identity(1,1) and a Diff column. This will come in handy for the rest of this...

Code:
[COLOR=blue]Update[/color] @Temp [COLOR=blue]Set[/color] Diff = [COLOR=#FF00FF]RowId[/color] - MatchFrom

[COLOR=blue]Select[/color] ProteinId, 
       [COLOR=#FF00FF]Min[/color](MatchFrom) [COLOR=blue]As[/color] MatchFrom,
       [COLOR=#FF00FF]Max[/color](MatchTo) [COLOR=blue]As[/color] MatchTo
[COLOR=blue]From[/color]   @Temp
[COLOR=blue]Group[/color] [COLOR=blue]By[/color] ProteinId, Diff

After the inserts, the @Temp Table will look like this.

[tt][blue]
RowId ProteinId MatchFrom MatchTo Diff
----------- ----------- ----------- ----------- -----------
1 1020 1 21 NULL
2 1020 2 23 NULL
3 1020 3 25 NULL
4 1020 4 27 NULL
5 1020 5 29 NULL
6 1020 6 31 NULL
7 1020 7 33 NULL
8 1020 8 35 NULL
9 1020 9 37 NULL
10 1020 10 39 NULL
11 1020 11 41 NULL
12 1020 12 43 NULL
13 1020 13 45 NULL
14 1020 14 47 NULL
15 1020 215 249 NULL
16 1020 216 251 NULL
17 1020 217 253 NULL
18 1020 218 255 NULL
19 1020 319 357 NULL
20 1020 320 359 NULL
21 1020 321 361 NULL
22 1020 322 363 NULL
[/blue][/tt]

After the Update ([!]Update @Temp Set Diff = RowId - MatchFrom[/!])

[tt][blue]
RowId ProteinId MatchFrom MatchTo Diff
----------- ----------- ----------- ----------- -----------
1 1020 1 21 0
2 1020 2 23 0
3 1020 3 25 0
4 1020 4 27 0
5 1020 5 29 0
6 1020 6 31 0
7 1020 7 33 0
8 1020 8 35 0
9 1020 9 37 0
10 1020 10 39 0
11 1020 11 41 0
12 1020 12 43 0
13 1020 13 45 0
14 1020 14 47 0
15 1020 215 249 -200
16 1020 216 251 -200
17 1020 217 253 -200
18 1020 218 255 -200
19 1020 319 357 -300
20 1020 320 359 -300
21 1020 321 361 -300
22 1020 322 363 -300
[/blue][/tt]

Notice the Diff Column. For each consecutive MatchFrom, the Diff column is the same value. We can use this to group the results, so that the final query produces the right results.

[tt][blue]
ProteinId MatchFrom MatchTo
----------- ----------- -----------
1020 319 363
1020 215 255
1020 1 47
[/blue][/tt]

So... Try this...

Code:
[COLOR=blue]CREATE[/color] [COLOR=blue]PROCEDURE[/color] spInsertSpecificitesRecords  @SEQ [COLOR=blue]as[/color] [COLOR=blue]varchar[/color](8000), @DesignID [COLOR=blue]as[/color] [COLOR=blue]int[/color] [COLOR=blue]AS[/color]


[COLOR=blue]Declare[/color] @SearchStr [COLOR=blue]varchar[/color](2000),
 @SearchStrMax [COLOR=blue]varchar[/color](2000),
 @startaa [COLOR=blue]int[/color],
 @endaa [COLOR=blue]int[/color],
 @aaLength [COLOR=blue]int[/color],
 @cntaa [COLOR=blue]int[/color]

[COLOR=blue]set[/color] @aaLength = LEN(@SEQ)
[COLOR=blue]Set[/color] @startaa = 1
[COLOR=blue]Set[/color] @endaa = 20
[COLOR=blue]Set[/color] @cntaa = 20

[COLOR=blue]Declare[/color] @Temp 
[COLOR=blue]Table[/color]   (
        [COLOR=#FF00FF]RowId[/color] [COLOR=blue]Int[/color] [COLOR=blue]Identity[/color](1,1)
        ProteinId [COLOR=blue]Int[/color], 
        MatchFrom [COLOR=blue]Int[/color],
        MatchTo [COLOR=blue]Int[/color],
        Diff [COLOR=blue]Int[/color]
        )

[COLOR=blue]while[/color] @startaa < @aaLength - 20
  [COLOR=blue]Begin[/color]

    [COLOR=blue]set[/color] @searchstr = [COLOR=#FF00FF]SUBSTRING[/color](@SEQ,@startaa,@cntaa)

    [COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] @Temp(ProteinId, MatchFrom, MatchTo)
    [COLOR=blue]SELECT[/color]  tblRDProteins.RDProteinsID, 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq), 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq) + @cntaa
    [COLOR=blue]FROM[/color]    tblRDProteins
    [COLOR=blue]WHERE[/color]   (((tblRDProteins.FullAASeq) LIKE [COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color]))

    [COLOR=blue]set[/color] @startaa = @startaa + 1
    [COLOR=blue]set[/color] @cntaa = @cntaa +1

[COLOR=blue]end[/color]

[COLOR=blue]Update[/color] @Temp [COLOR=blue]Set[/color] Diff = [COLOR=#FF00FF]RowId[/color] - MatchFrom

[COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] tblmatch (ProteinID, DesignID, MatchFrom, Matchto) 
[COLOR=blue]Select[/color] ProteinId, 
       @DesignID,
       [COLOR=#FF00FF]Min[/color](MatchFrom) [COLOR=blue]As[/color] MatchFrom,
       [COLOR=#FF00FF]Max[/color](MatchTo) [COLOR=blue]As[/color] MatchTo
[COLOR=blue]From[/color]   @Temp
[COLOR=blue]Group[/color] [COLOR=blue]By[/color] ProteinId, Diff

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Oh my goodness so very close. I can't tell you how much I appreciate your help. SO now the problem is is that it is searching multiple proteins (which it needs to) and coming up with this in the temp table:

Row Protein From To Diff
1 1020 1 21 0
2 1021 1 21 1
3 1022 1 21 2
4 1023 1 21 3
5 1024 1 21 4
6 1020 2 23 4
7 1021 2 23 5
8 1022 2 23 6
9 1023 2 23 7
10 1024 2 23 8
11 1020 3 25 8
12 1021 3 25 9
13 1022 3 25 10
14 1023 3 25 11
15 1024 3 25 12
16 1020 4 27 12
17 1021 4 27 13
18 1022 4 27 14
19 1023 4 27 15
20 1024 4 27 16
21 1020 5 29 16
22 1021 5 29 17

this is of course adding almost all of the records in tblMatch.

THANK YOU!!!!!

We don't pay you enough!
 
Agreed! Even if you abandon me now you have gotten me so very much further than I would have been.

Scott
 
In fact Denis here is one for helping me out so many time in the past! One silly star is never enough for they help you gents give out.

Scott
 
The problem is that I assumed (incorrectly) that the list would be properly sorted when adding it to the temp table. No problems, though, because we can simply create another temp table, then insert the data the way I expected it to be from the beginning.

Code:
[COLOR=blue]CREATE[/color] [COLOR=blue]PROCEDURE[/color] spInsertSpecificitesRecords  @SEQ [COLOR=blue]as[/color] [COLOR=blue]varchar[/color](8000), @DesignID [COLOR=blue]as[/color] [COLOR=blue]int[/color] [COLOR=blue]AS[/color]


[COLOR=blue]Declare[/color] @SearchStr [COLOR=blue]varchar[/color](2000),
 @SearchStrMax [COLOR=blue]varchar[/color](2000),
 @startaa [COLOR=blue]int[/color],
 @endaa [COLOR=blue]int[/color],
 @aaLength [COLOR=blue]int[/color],
 @cntaa [COLOR=blue]int[/color]

[COLOR=blue]set[/color] @aaLength = LEN(@SEQ)
[COLOR=blue]Set[/color] @startaa = 1
[COLOR=blue]Set[/color] @endaa = 20
[COLOR=blue]Set[/color] @cntaa = 20

[COLOR=blue]Declare[/color] @Temp 
[COLOR=blue]Table[/color]   (
        ProteinId [COLOR=blue]Int[/color], 
        MatchFrom [COLOR=blue]Int[/color],
        MatchTo [COLOR=blue]Int[/color],
        )

[COLOR=blue]while[/color] @startaa < @aaLength - 20
  [COLOR=blue]Begin[/color]

    [COLOR=blue]set[/color] @searchstr = [COLOR=#FF00FF]SUBSTRING[/color](@SEQ,@startaa,@cntaa)

    [COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] @Temp(ProteinId, MatchFrom, MatchTo)
    [COLOR=blue]SELECT[/color]  tblRDProteins.RDProteinsID, 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq), 
            [COLOR=#FF00FF]PATINDEX[/color]([COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color], tblRDProteins.FullAASeq) + @cntaa
    [COLOR=blue]FROM[/color]    tblRDProteins
    [COLOR=blue]WHERE[/color]   (((tblRDProteins.FullAASeq) LIKE [COLOR=red]'%'[/color] +@Searchstr + [COLOR=red]'%'[/color]))

    [COLOR=blue]set[/color] @startaa = @startaa + 1
    [COLOR=blue]set[/color] @cntaa = @cntaa +1

[COLOR=blue]end[/color]

[COLOR=blue]Declare[/color] @Temp1
[COLOR=blue]Table[/color]   ([COLOR=#FF00FF]RowId[/color] [COLOR=blue]Int[/color] [COLOR=blue]Identity[/color](1,1),
        ProteinId [COLOR=blue]Int[/color],
        MatchFrom [COLOR=blue]Int[/color],
        MatchTo [COLOR=blue]Int[/color],
        Diff [COLOR=blue]int[/color]
        )

[COLOR=blue]Insert[/color]
[COLOR=blue]Into[/color]   @Temp1(ProteinId, MatchFrom, MatchTo)
[COLOR=blue]Select[/color] ProteinId, MatchFrom, MatchTo
[COLOR=blue]From[/color]   @Temp
[COLOR=blue]Order[/color] [COLOR=blue]By[/color] ProteinId, MatchFrom

[COLOR=blue]Update[/color] @Temp1 [COLOR=blue]Set[/color] Diff = [COLOR=#FF00FF]RowId[/color] - MatchFrom

[COLOR=blue]INSERT[/color] [COLOR=blue]INTO[/color] tblmatch (ProteinID, DesignID, MatchFrom, Matchto) 
[COLOR=blue]Select[/color] ProteinId, 
       @DesignID,
       [COLOR=#FF00FF]Min[/color](MatchFrom) [COLOR=blue]As[/color] MatchFrom,
       [COLOR=#FF00FF]Max[/color](MatchTo) [COLOR=blue]As[/color] MatchTo
[COLOR=blue]From[/color]   @Temp1
[COLOR=blue]Group[/color] [COLOR=blue]By[/color] ProteinId, Diff

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
One star not enough! GIVE THE MAN SOME STARS!!!! He rocks!
 
So... it's working the way you want it to? Glad to hear it, and thanks for the star.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
George, that is very nice :-D

You truly do rock! Have another....

Alex

Ignorance of certain subjects is a great part of wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top