A complicated preg_replace() problem....

karnaf · Dec 30, 2003

I've been trying to find a way through this, but it's just getting frustrating... The closest info I've found about this was in thread434-523609 but the problem there is different from mine.

I need to go over an html file and highlight words or part of words in it (kind of what you get at the end of a search engine result like google). The problem is, I don't want words within html tags, any html tag, to be highlighted.

For example,
$text=" let's bring the house down <center>Here we are in the center of town</center>"
key words for searching - br AND center
What I want to get is [thumbsup]

$text=" let's bring the house down <center>Here we are in the center of town</center>"
and not [thumbsdown]

$text="<br>let's bring the house down<br><center>Here we are in the center of town</center>"

The basic idea is that I have an array of the keywords to search for $keys_arr, where all the values are of the form "/$word/si". The replace array $replace_arr has values of the form "$word". The action line is
$text = preg_replace($keys_arr,$replace_arr,$text);

I'm pretty sure there should be some regular expression solution, but I just can't seem to find it :

Thanks

davidshields · Dec 31, 2003

Do you HAVE to preserve the original HTML ? If not you could have a regexp that strips out HTML, giving

let's bring the house down Here we are in the center of town

then one to highlight br and center.

Just a thought.

David.

piti · Dec 31, 2003

if you use "/(?<!\<|\<\/)$word/si" in the $keys_arr, it will not touch the html tags, but only if the searched $word is identical to the tag beginning
e.g.
searching for "center" or "cent", will not match <center> or </center>, but search for "enter" will modify these html tags
if you prepare some algorithm to test if $word is a part of any html tag but not the starting part (e.g. "enter" is part of <center>) then "/(?<!\<|\<\/)([a-z]+)$word/si" in $keys_arr and "\${1}$word" in $replace_arr will do the job
hope this helps

piti · Dec 31, 2003

well i forgot to mention that, as you can see from the code, it has more limitations than mentioned
like searching some tag params (align, height, style, ...) modifies the tag too

Westbury · Jan 2, 2004

The only thing I can think of is what davidshields suggested. However, if you need to keep all of the HTML tag then I suggest that you go through the page character by character. Add the character to a buffer and set a flag to true if it encounters a "<" and false is it encounters a ">". Then if your flag is set to false then check the buffer for the key words. Finally if you encounter either the end of the file or a space, output the buffer.
Its a long winded way of doing it but at least you could customise (or customize, since this seems to be an american site) how the search is performed.

karnaf · Jan 2, 2004

Thanks, but the problem is greater than this..

A. I have to preserve the html page, just modify it.
B. what if we have something like "please check if a<5 and b>6" as the text? how can we tell that "<5 and b>" is not an html tag when using preg? we cant use the rule of avoiding anything within <>....

The only solution I see is to code my own function that will parse the text and look explicitly for all known html tags... I just hate to think there is no easier solution, something more elegant...

thanks!

Westbury · Jan 2, 2004

In html the < and > character are coded as > and < so it shouldn't pick up anything that isn't a html tag.
My character by character search would preserve the HTML. maybe I didn't explain it very well. I will try and put it into a sort of pseudo code:
1. add character to buffer
2. if character is "<" set flag to true
3. if character is ">" set flag to false and output buffer
4. if flag is false then perform string replace
5. if character is " " or EOF then output buffer and clear buffer
6.loop until EOF

This way the str_replace is only used when it outside of a HTML tag, and everything is output.

Hope that clears it up a little.

karnaf · Jan 6, 2004

Thanks...

I've found a solution that seems to be working fine, though it hasn't yet been tested in the real workd

I didn't feel like making Turing machine, that will go char by char, so I used other stuff, but in general, used your idea Westbury. The idea is to cut the text into pieces acording to the locations of < and >. Then preg_replace the necessary parts.

Here you have the code.
------------------------------------------------------------
$old_gt = 0; // Last > position in the text
$old_lt = 0; // Last < position in the text
$text = ' '.$text;
$res_text="";
while((($cur_lt = strpos($text, '<', $old_gt+1)) !== false)&&
(($cur_gt = strpos($text, '>', $cur_lt+1)) !== false))
{
$tmp_lt = strpos($text, '<', $cur_lt+1);
while (($tmp_lt !== false) && ($cur_gt > $tmp_lt))
{
$cur_lt = $tmp_lt;
$tmp_lt = strpos($text, '<', $cur_lt+1);
}
$res_text.= preg_replace($keys_arr,$replace_arr,substr($text, $old_gt, ($cur_lt-$old_gt)));
$res_text.= substr($text, $cur_lt, ($cur_gt-$cur_lt));
$old_gt = $cur_gt;
$old_lt = $cur_lt;
}
$res_text.= preg_replace($keys_arr,$replace_arr,substr($text, $cur_gt));
$text = trim($res_text);
------------------------------------------------------------
$keys_arr = the words that needs to be highlighted
$replace_arr = the words highlighted.

Thanks all.
I was just looking for a more elegant solution

I and will still love to get one better if any of you can think of one.

karnaf

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

A complicated preg_replace() problem....

karnaf

Programmer

davidshields

Programmer

piti

Technical User

piti

Technical User

Westbury

Programmer

karnaf

Programmer

Westbury

Programmer

karnaf

Programmer

Similar threads

Part and Inventory Search

Sponsor