×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Encoding Foreign Characters

Encoding Foreign Characters

Encoding Foreign Characters

(OP)
I've had a function on my site for years and it has been working reasonably well in converting foreign characters from a form's textarea field to the corresponding HTML entities. But the problem is that the text being inserted comes from a wide variety of sources where I have no control of its actual encoding so it occasionally crashes or puts in codes that were not in the original, causing the resulting page to be unreadable. All of the HTML and PHP pages, the database and the connection are already set to UTF8.

Because the problem was to do with encoding, I added some PHP functions to the WHILE loop that should sort it out. I don't want to replace all characters - only certain ones that exist in a table but I cannot get it to work properly. If I return $CharacterName it gives the last character in the table as expected due to the looping. If I return $Replacement it gives the proper HTML entity as expected but the characters are not being replaced in $BodyText.

I can't change $DBcharacters->next_record() or other database functions as these are custom functions used elsewhere and they are working well but I can change the looping so if anyone has a thought about this, I would appreciate hearing it! It should be something quite simple that I have missed. Thank you.

CODE --> PHP

function cleanHTML($BodyText) {
// replace foreign characters
	$SQL = "SELECT CharacterName FROM charactercodes";
	$DBcharacters->query($SQL);
	while ($DBcharacters->next_record()) {
		$CharacterName = htmlspecialchars_decode(htmlentities($DBcharacters->f("CharacterName"), ENT_QUOTES, "UTF-8"));
		$Replacement = htmlentities($CharacterName);
		$BodyText = str_replace($CharacterName, $Replacement, $BodyText);
	}
  return $BodyText;
} 

RE: Encoding Foreign Characters

(OP)
I have it working although it's a bit of a kludge because I don't think the WHILE loop is really what's doing the work. This line is but without it, it does nothing:

CODE --> PHP

$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8")); 

Also, without this line it creates HTML entitles for all the ampersands when that is not a character in the charactercodes table but it apparently does not create HTML entities for < and > so that HTML still works if it happens to be embedded into the text.

CODE --> PHP

$BodyText = str_replace("&", "&", $BodyText); 

If anyone can tell me what to do to make it work as it should, I would appreciate it:

CODE --> PHP

function cleanHTML($BodyText) {
// replace foreign characters
	$SQL = "SELECT CharacterName FROM charactercodes";
	$DBcharacters->query($SQL);
	$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8"));
	while ($DBcharacters->next_record()) {
		$OriginalCharacter = $DBcharacters->f("CharacterName");
		$CharacterName = htmlspecialchars_decode(htmlentities($OriginalCharacter, ENT_QUOTES, "UTF-8"));
		$Replacement = htmlentities($CharacterName);
		$BodyText = str_replace($CharacterName, $Replacement, $BodyText);
	}
	$BodyText = str_replace("&amp;", "&", $BodyText);
  return $BodyText;
} 

RE: Encoding Foreign Characters

forgive me for hijacking this thread but two similar threads in as many days suggests that other readers might benefit from some quick reminders

Quote:


a wide variety of sources where I have no control of its actual encoding
does that mean that they are not all uploaded over a web form via a text box?
if so, then you do have control over the encoding, as you receive it.

if they are uploaded as files, then again you have the ability to detect the encoding at that time, and take steps to normalise it.

if you are storing the uploaded file in raw form (as a blob) then again, you have the ability to detect the encoding of the file and convert it to whatever form your display can handle.

if the text is being extracted from (say) a pdf then uploaded then it depends on the process. if you are just taking the text out (no OCR) then you can get the encoding of the PDF at the same time and make appropriate manipulations at that moment. If an OCR then you are in more difficulties although you would still have to extract the encoding of the PDF to inform the OCR suite what character set is being represented on the page. most good OCR suites will then output in whatever encoding you specify. or you can manipulate the output yourself from the known output of the OCR suite.

the core take home is that at all points you should try to ensure that you do have control over/knowledge of the incoming encoding and take steps at that time to convert and/or preserve knowledge of the encoding. Otherwise you will be forever relying on guess work.

another good lesson for other readers is (apart from transliteration for encoding purposes), don't manipulate data to be stored with non-idempotent actions like htmlspecialchars etc. store in the raw form and manipulate on display. If you must store a manipulated form because your server is too slow for realtime operations, ensure you ALSO store the raw.

---

anyway, back to your actual question!

questions for you

1. did you manipulate the data before inserting into the database? if so can you post the manipulation code?
2. did you also store the raw data? if so can you post a before and after version of some troublesome text.

as i decode your function my understand is it does the following (I have put in a code block to make it 'easier' to read)...

CODE


$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8"));

encode the bodytext with htmlentity substitution. this means that _every_ character that is in
the text that has an html entity equivalent will be substituted. [consider using HTML5 as an
additional flag; also consider using ENT_COMPAT unless you are certain that single quotes
are not being used as double escape characters anywhere]

then with that output reverse out the work that you have just done, so that
&amp; becomes ampersand
&quot; becomes double quotes
&#039; are left untouched and not converted back to single quotes (you have not sent
the quotes flag)
&lt; becomes the less than symbol
&gt; becomes the greater than symbol

I don't see the value in that exercise but trust that you have a reason.

$CharacterName = htmlspecialchars_decode(htmlentities($OriginalCharacter, ENT_QUOTES, "UTF-8"));

you then do the same thing on a character by character basis. so taking for example the pound
sign. that will be converted to an html entity and then left intact (as it is not an html
special character).

so $CharacterName at that point will be &pound;

$Replacement = htmlentities($CharacterName);

this confuses me. at this point you are reconverting something that has either just been
converted or just been converted and unconverted. and importantly you are not specifying
a strategy for quote handling nor an output charset. potentially a recipe for disaster.

the reconversion will completely break our pound example as each char will be converted.

so $CharacterName is (at the moment) &pound;
$Replacement will be &amp;pound;

is that intended? no browser will be able to deal with that to display a pound sign.
it will look like &pound;

next you do a search and replace of $BodyText such that any 'properly' formed &pound;(s)
will be broken.

$BodyText = str_replace("&amp;", "&", $BodyText);
and then lastly, once those transformations are done, you go back through the whole
string and transform those &amp;pound; back to how they should be

BUT you also transform PROPER ampersand character entities back to a pure ampersand.
hopefully with modern browsers and a proper page encoding that won't matter. but it does
rather undo a good part of what you are intending.

I ran this through an interactive php session so you could see what is happening. here is the trace

CODE


php > $OriginalCharacter = '£';
php > $CharacterName = htmlentities($OriginalCharacter, ENT_QUOTES, 'UTF-8');
php > echo $CharacterName ."\n";
&pound;
php > $CharacterName = htmlspecialchars_decode($CharacterName);
php > echo $CharacterName ."\n";
&pound;
php > $Replacement = htmlentities($CharacterName);
php > echo $Replacement ."\n";
&amp;pound;
php >

my suggestion is to take a step back and rearticulate the original problem, then work out how to solve it. Unless I have missed the point, these iterative steps are not the way to go.

and beware - these functions are only safe if both input and output share the same character set. if there is a chance that the input does not share the same character set then this has a great chance of garbaging the output.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close