finding in html page

Oostwijk · Nov 26, 2003

I've got a html page in which these lines occure:

• geavanceerd zoeken • voorkeuren • taalhulpmiddelen het web doorzoeken zoeken in <br><br>Gevondenhi<hr>5

I've built an script that searches for characters/words (defined by a user) on that html page. The search is done with the line:

if (grep /\Q$words\E/,$htmlFile){print "Found";}else{print qq{"Not Found";}; }

It works perfectly, except I want to be able to search for the diamond character, which is created by •

So my question is:
How can I search on the html page that the browser shows, not on the actual html source code.

I do hope you understand my question.

philote · Nov 26, 2003

Can't you just search for &#8226 in the html?

Kevin
A+, Network+, MCP

philote · Nov 26, 2003

Hmm, I previewed my post and everything. It should have had the code for the dot, not the dot itself.

Kevin
A+, Network+, MCP

siberian · Nov 26, 2003

What the browser shows is a myth, an illusion, a figment of our imagination created by programmers who can read UNICODE.

You need to parse the page as the unicode (utf8) document it is and search for the unicode you are looking to find.

I think I answered this earlier or in another forum.

Go to CPAN and search for 'UTF8'. There are a ton of modules available as well as perls built in 'use utf8' functionality.

Its not easy but its the way to go here.

Oostwijk · Nov 27, 2003

Thanks so far. I've looked at multiple websites regarding utf8, and I think I must use the utf8::decode($string) line.
So I made this little script:

#!/usr/local/bin/perl
use CGI;
use utf8;
$string="Woordjes=google • geavanceerd zoeken • voorkeuren •";
$words="e";
if (grep /\Q$words\E/,utf8::decode($string)){print "Found";}else{print qq{"Not Found";}; }

Am I doing this ok ? If $words would contain the diamond character would the grep statement find it in $string ?

When I try this script I get the error message:
Undefined subroutine utf8::decode called at line 10
What am I doing wrong ?

Oostwijk · Nov 28, 2003

I hope someone can help me out..
Note that when I post this line to the forum:
$string="Woordjes=google • geavanceerd zoeken • voorkeuren •";

the diamond characters are actually html code, the forum prints them out as diamond characters

siberian · Nov 28, 2003

Here is a list of unicode entities for download

http://www.bebits.com/app/688

What I would do is decode the string, print it out, get the unicode value for the special character and then use that code in my grep's.

As for the error, what version of perl are you running?

Oostwijk · Nov 28, 2003

I've got Perl version 5.6.1.631 and run my script with perlwiz.

When I look into the UTF8.pm file I can see these lines
package utf8;

$utf8::hint_bits = 0x00800000;

our $VERSION = '1.02';

sub import {
$^H |= $utf8::hint_bits;
$enc{caller()} = $_[1] if $_[1];
}

sub unimport {
$^H &= ~$utf8::hint_bits;
}

sub AUTOLOAD {
require "utf8_heavy.pl";
goto &$AUTOLOAD if defined &$AUTOLOAD;
Carp::croak("Undefined subroutine $AUTOLOAD called&quot

;
}

1;
__END__

and some explenation on how to work with UTF8. But I don't see any decode/encode section in the lines above.

siberian · Nov 28, 2003

I think utf8 was only fully implemented in perl 5.8.x. You may want to try upgrading your perl and see what happens.

A simpler route is to download one of the UTF8 packages from CPAN. Thats the less intrusive path.

Oostwijk · Nov 28, 2003

thanks I'll try thst

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

finding in html page

Oostwijk

Technical User

philote

MIS

philote

MIS

siberian

Programmer

Oostwijk

Technical User

Oostwijk

Technical User

siberian

Programmer

Oostwijk

Technical User

siberian

Programmer

Oostwijk

Technical User

Similar threads

Part and Inventory Search

Sponsor