Find non latin sentences from a UTF-8 text file and replace it

dimsis · Apr 15, 2004

How can i seperate non latin characters (Greek) from a UTF-8 text file, and replace it with a string?
I need to replace ALL the non latin sentence and not every char or word, so the function / routine / algorithm must have the logic to distinguish a full non latin series of characters as a sentence and change it.

<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699">Here is the first non latin chars</td>
</tr>
<tr>
<td>Here are some more non latin chars</td>
</tr>

etc...
i need code to "scan" all the UTF-8 sentences and replace it to whatever i want... for example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699"><REPLACED WITH FIRST CUSTOM TAG></td>
</tr>
<tr>
<td><REPLACED WITH SECOND CUSTOM TAG></td>
</tr>

Thanks in advance,
Dimitris

chiph · Apr 18, 2004

This is rather difficult in VB6 (piece of cake in VB.NET).

The problem is that you're starting with a multi-byte UTF8 file. You need to call the Win32 API MultiByteToWideChar to convert the contents into a UTF16 string (the kind that VB uses internally). You can then compare the contents of the string against your Greek characters (checking for full sentences, etc) and replace them.

You'd then call the other encoding conversion function from Win32: WideCharToMultiByte, to write the contents of your new string back to your file.

I've never had any luck calling those particular functions from within VB, I've always had to write a ATL C++ DLL to make the call (other people seem to be able to do it -- I guess they have the magic touch).

Also, don't expect super-fast speed. VB's memory allocation for moving around large strings is not the fastest in the world.

Chip H.

If you want to get the best response to a question, please check out FAQ222-2244 first

dimsis · Apr 18, 2004

Thank you Chip for your suggestions,

if someone has any VB example with these API's or any other solution, please post a reply.

Thanx in advance,
Dimitris

dimsis · Apr 18, 2004

Also...
if someone can post an example (WITHOUT using UTF-8 characters), it could help me alot.
So the example i need is how can i replace only NON-LATIN sentences from VB to my string.

Example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699">Here is the first non latin chars</td>
</tr>
<tr>
<td>Here are some more non latin chars</td>
</tr>

etc...
i need code to "scan" all the UTF-8 sentences and replace it to whatever i want... for example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699"><REPLACED WITH FIRST CUSTOM TAG></td>
</tr>
<tr>
<td><REPLACED WITH SECOND CUSTOM TAG></td>
</tr>

happyabc · Apr 19, 2004

Hey, whats all this "UTF" stuff anyway?

dimsis · Apr 19, 2004

Unicode is well on the way to replace ASCII, ISO 8859 and EUC at all levels. It allows you to handle not only text in practically any script and language used on this planet, it also provides you with a comprehensive set of mathematical and technical symbols to simplify scientific information exchange.

With the UTF-8 encoding, Unicode can be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems. It is now time to make sure that you are well familiar with it and that your software supports UTF-8 smoothly.

Link:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

happyabc · Apr 19, 2004

Thanks for a good link.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Find non latin sentences from a UTF-8 text file and replace it

dimsis

Programmer

chiph

Programmer

dimsis

Programmer

dimsis

Programmer

happyabc

IS-IT--Management

dimsis

Programmer

happyabc

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor