Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Shaun E on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

finding in html page

Status
Not open for further replies.

Oostwijk

Technical User
Oct 19, 2003
82
NL
I've got a html page in which these lines occure:

&#8226; geavanceerd zoeken &#8226; voorkeuren &#8226; taalhulpmiddelen het web doorzoeken zoeken in <br><br>Gevondenhi<hr>5

I've built an script that searches for characters/words (defined by a user) on that html page. The search is done with the line:

if (grep /\Q$words\E/,$htmlFile){print &quot;Found&quot;;}else{print qq{&quot;Not Found&quot;;}; }

It works perfectly, except I want to be able to search for the diamond character, which is created by &#8226;

So my question is:
How can I search on the html page that the browser shows, not on the actual html source code.

I do hope you understand my question.
 
Can't you just search for &#8226 in the html?

Kevin
A+, Network+, MCP
 
Hmm, I previewed my post and everything. It should have had the code for the dot, not the dot itself.



Kevin
A+, Network+, MCP
 
What the browser shows is a myth, an illusion, a figment of our imagination created by programmers who can read UNICODE.

You need to parse the page as the unicode (utf8) document it is and search for the unicode you are looking to find.

I think I answered this earlier or in another forum.

Go to CPAN and search for 'UTF8'. There are a ton of modules available as well as perls built in 'use utf8' functionality.

Its not easy but its the way to go here.
 
Thanks so far. I've looked at multiple websites regarding utf8, and I think I must use the utf8::decode($string) line.
So I made this little script:

#!/usr/local/bin/perl
use CGI;
use utf8;
$string=&quot;Woordjes=google &#8226; geavanceerd zoeken &#8226; voorkeuren &#8226;&quot;;
$words=&quot;e&quot;;
if (grep /\Q$words\E/,utf8::decode($string)){print &quot;Found&quot;;}else{print qq{&quot;Not Found&quot;;}; }

Am I doing this ok ? If $words would contain the diamond character would the grep statement find it in $string ?

When I try this script I get the error message:
Undefined subroutine utf8::decode called at line 10
What am I doing wrong ?

 
I hope someone can help me out..
Note that when I post this line to the forum:
$string=&quot;Woordjes=google • geavanceerd zoeken • voorkeuren •&quot;;

the diamond characters are actually html code, the forum prints them out as diamond characters
 
Here is a list of unicode entities for download


What I would do is decode the string, print it out, get the unicode value for the special character and then use that code in my grep's.

As for the error, what version of perl are you running?
 
I've got Perl version 5.6.1.631 and run my script with perlwiz.

When I look into the UTF8.pm file I can see these lines
package utf8;

$utf8::hint_bits = 0x00800000;

our $VERSION = '1.02';

sub import {
$^H |= $utf8::hint_bits;
$enc{caller()} = $_[1] if $_[1];
}

sub unimport {
$^H &= ~$utf8::hint_bits;
}

sub AUTOLOAD {
require &quot;utf8_heavy.pl&quot;;
goto &$AUTOLOAD if defined &$AUTOLOAD;
Carp::croak(&quot;Undefined subroutine $AUTOLOAD called&quot;);
}

1;
__END__

and some explenation on how to work with UTF8. But I don't see any decode/encode section in the lines above.
 
I think utf8 was only fully implemented in perl 5.8.x. You may want to try upgrading your perl and see what happens.

A simpler route is to download one of the UTF8 packages from CPAN. Thats the less intrusive path.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top