Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

REGEX CHALLENGE 1

Status
Not open for further replies.

Deleted

Technical User
Jul 17, 2003
470
US
Is it possible to REGEX (search & replace with nothing) a pattern by letter order..

The pattern would be the word 'script' either split by characters or spaces. Also the data is in a single string.

But could be written in different ways..

document.write('<' + 's' + 'c' + 'r' + 'i' + 'p' + 't' + '>');

or

document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');

or

document.write('<' + 's' + 'c' + 'x' + 'r' + 'i' + 'y' + 'p' + 't' + '>');

but do not match when no spaces or characters exist.. like --> script

Thanks in advance..

M. Brooks
X Concepts LLC
 
If I understand your question correctly, this should do what you're looking for. The only question is, what do you want to do with all the blank "+ ''" strings that are littered about in the lines?
Code:
undef $/;
my $string = <DATA>;

$string =~ s/s(.+?')c(.+?')r(.+?')i(.+?')p(.+?')t/$1$2$3$4$5/sg;
print "$string\n";

__DATA__
document.write('<' + 's' + 'c' + 'r' + 'i' + 'p' + 't' + '>')
document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');
document.write('<' + 's' + 'c' + 'x' + 'r' + 'i' + 'y' + 'p' + 't' + '>');
script
 
That's a very vague description of the regexp you want. For example, this piece of text contains all the characters you intend to match, which is probably not desirable.
 
I was just about to say exactly the same as ishnid has pointed out - and he's explained the problem beautifully

script can even be found within a single word which can be found in the dictionary - never mind amongst series of words (as ish's example)

sacripant


Kind Regards
Duncan
 
That's a very vague description of the regexp you want. For example, this piece of text contains all the characters you intend to match, which is probably not desirable."

Mainly I intend to match anything text that is located within a 'document.write(' everything else should be ignored.

M. Brooks
X Concepts LLC
 
doesn't the example below contradict your post
within a 'document.write(' everything else should be ignored.

[tt]document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');[/tt]

i.e. what above delimits the end of the 1st document.write( ???


Kind Regards
Duncan
 
Sorry about that.. Didn't get my coffee yet this morning..

Let me explain what I am trying to do.

I have a script that allows javacript to be posted into a webform. The problem with allowing javascript is that a hacker can use the 'document.write' function to embed another script within the script. These examples are the most common ways this will be atempted.


<script language="javascript">
document.write('<' + 's' + 'c' + 'r' + 'i' + 'p' + 't' + '>');
</script>

or

<script language=javascript>
document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');
</script>

Any data that exists outside the <script language=javascript> and </script> should be ignored.

Thanks again..

M. Brooks
X Concepts LLC
 
An example of what I would do to take process this

<script language="javascript">
document.write('<' + 'x' + 's' + 'c' + 'r' + 'y' + 'i' + 'p' + 't' + 'f' + '>');
</script>

is

$field_data =~ s/document.write\(.*s.*c.*r.*i.*p.*t.*\);//gi;

Which would leave you with

<script language="javascript">

</script>

Still trying to work on a solution to this one though

<script language=javascript>
document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');
</script>

Thank you guys for your assistance.. ;)

M. Brooks
X Concepts LLC
 
no problem dude. no offense taken. :)

the 3 liner is gonna be a tough one though!


Kind Regards
Duncan
 
I figured so.. But I am still going to take a whack at it..

The javascript issue has to do with concatenations using '+'
and the 'document.write()'

Another way I solved this

<script language="javascript">
document.write('<' + 's' + 'c' + 'r' + 'i' + 'p' + 't' + '>');
</script>

is with

$field_data =~ s/document.write\((.*|\+)s(.*|\+)c(.*|\+)r(.*|\+)i(.*|\+)p(.*|\+)t(.*|\+)\);//gi;

Thanks again Duncan..

M. Brooks
X Concepts LLC
 
Well PerlJunky - I do apologise - it seems that your request can be achieved. It took me hours and is the most involved regex I have ever written. My small brain hurts. But I am overjoyed with the result. I must thank you for asking such a mind-bending question. Hope you like it! :)

Code:
[b]#!/usr/bin/perl[/b]

undef $/;

$_ = <DATA>;

s|(<script language=javascript>\n)document.write\('<'(\);)?(\ndocument.write\()?(\W*\+\W*)?'s'(\);)?(\ndocument.write\()?(\W*\+\W*)?'c'(\);)?(\ndocument.write\()?(\W*\+\W*)?'r'(\);)?(\ndocument.write\()?(\W*\+\W*)?'i'(\);)?(\ndocument.write\()?(\W*\+\W*)?'p'(\);)?(\ndocument.write\()?(\W*\+\W*)?'t'(\);)?(\ndocument.write\()?(\W*\+\W*)?'>'(\);)?|$1|g;

print;
print "\n";
print "—" x 50;
print "\n";

[blue]__DATA__
<script language=javascript>
document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + '>');
</script>

<script language=javascript>
document.write('<' + 's' + 'c');
document.write('r' + 'i');
document.write('p' + 't' + 'y' + '>');
</script>

<script language=javascript>
document.write('<');
document.write('s');
document.write('c');
document.write('r');
document.write('i');
document.write('p');
document.write('t');
document.write('>');
</script>

<script language=javascript>
document.write('<' + 'l' + 'e' + 'a' + 'v' + 'e' + '>');
</script>

<script language=javascript>
document.write('<' + 's' + 'c' + 'r' + 'i' + 'p' + 't' + '>');
</script>

<script language=javascript>
document.write('<'+'s'+'c'+'r'+'i'+'p'+'t'+'>');
</script>[/blue]


Kind Regards
Duncan
 
PeRlJuNkY,

Since document.write()'s arguments are not limited to literals, wouldn't there be an infinite number of ways to insert a script?

For instance:

x = 'pt';
y = 'sc';
z = 'ri';
document.write(y + z + x);

Given all the possible transformations and encodings, it doesn't seem like prevention of embedded scripts is possible with this approach. I'm sure there's code out there (maybe CPAN) for totally "defanging" a block of HTML (with or without javascript) so that it appears as it was entered and is not rendered/executed. Converting '<' to '&lt;' and so forth.

Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L

 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top