Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Find non latin sentences from a UTF-8 text file and replace it

Status
Not open for further replies.

dimsis

Programmer
Aug 6, 2000
76
GR
How can i seperate non latin characters (Greek) from a UTF-8 text file, and replace it with a string?
I need to replace ALL the non latin sentence and not every char or word, so the function / routine / algorithm must have the logic to distinguish a full non latin series of characters as a sentence and change it.

<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699">Here is the first non latin chars</td>
</tr>
<tr>
<td>Here are some more non latin chars</td>
</tr>

etc...
i need code to "scan" all the UTF-8 sentences and replace it to whatever i want... for example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699"><REPLACED WITH FIRST CUSTOM TAG></td>
</tr>
<tr>
<td><REPLACED WITH SECOND CUSTOM TAG></td>
</tr>

Thanks in advance,
Dimitris
 
This is rather difficult in VB6 (piece of cake in VB.NET).

The problem is that you're starting with a multi-byte UTF8 file. You need to call the Win32 API MultiByteToWideChar to convert the contents into a UTF16 string (the kind that VB uses internally). You can then compare the contents of the string against your Greek characters (checking for full sentences, etc) and replace them.

You'd then call the other encoding conversion function from Win32: WideCharToMultiByte, to write the contents of your new string back to your file.

I've never had any luck calling those particular functions from within VB, I've always had to write a ATL C++ DLL to make the call (other people seem to be able to do it -- I guess they have the magic touch).

Also, don't expect super-fast speed. VB's memory allocation for moving around large strings is not the fastest in the world.

Chip H.


If you want to get the best response to a question, please check out FAQ222-2244 first
 
Thank you Chip for your suggestions,

if someone has any VB example with these API's or any other solution, please post a reply.

Thanx in advance,
Dimitris
 
Also...
if someone can post an example (WITHOUT using UTF-8 characters), it could help me alot.
So the example i need is how can i replace only NON-LATIN sentences from VB to my string.

Example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699">Here is the first non latin chars</td>
</tr>
<tr>
<td>Here are some more non latin chars</td>
</tr>

etc...
i need code to "scan" all the UTF-8 sentences and replace it to whatever i want... for example:
<tr valign="top" bgcolor="ffffff">
<td bgcolor="446699"><REPLACED WITH FIRST CUSTOM TAG></td>
</tr>
<tr>
<td><REPLACED WITH SECOND CUSTOM TAG></td>
</tr>
 
Unicode is well on the way to replace ASCII, ISO 8859 and EUC at all levels. It allows you to handle not only text in practically any script and language used on this planet, it also provides you with a comprehensive set of mathematical and technical symbols to simplify scientific information exchange.

With the UTF-8 encoding, Unicode can be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems. It is now time to make sure that you are well familiar with it and that your software supports UTF-8 smoothly.

Link:
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top