×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Problem with Eastern European languages in reports
2

Problem with Eastern European languages in reports

Problem with Eastern European languages in reports

(OP)
Hi,
I have a VFP9 web application where I want to add support for the Czech language. It all works fine except for the reports. If Czech words are entered via the web interface, they are stored in the table like this:


When I list these words again in the web browser, it looks perfectly fine, like this:


The problem is in the reports where the contents of the table is listed except for a couple of letters like for example "Š" that are actually shown correctly:


My current codepage is 1252 and I have tried other codepages, STRCONV(), etc, but I can never get VFP to show the original texts again. Is information lost once the words are written to the table? If so, how come that the web browser is able to show it correctly? Any hints how to get the reports to show the original texts?

BR,
Micael

RE: Problem with Eastern European languages in reports

The web page likely is UTF-8. It doesn't matter what codepage your DBF is, looking into it you see some garbled characters, but they still make up the bytes representing the correct UTF-8 charcters, so there's no problem.

The report output shows your Web page input also contains some HTML entities, those will only be displayed correctly in a browser.

If you want to print data it has to limit itself to the DBF encoding and that has to match current codepage or you make use of report controls that are capable to show your data cirrectly or you don't print normal reports but output HTML pages to print, there's a lot you can do in CSS to have printable HTML, see https://developer.mozilla.org/en-US/docs/Web/CSS/P...

If you intend to stay VFP reports you have to clean your data so it a) has no HTML entities and b) is conforming to the codepage choice you make. 1252 is Western European, you do have a better choice with 1250 Eastern European Codepage supported on Eastern European Windows versions.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
Aha, I didn't realize that the data stored in the tables contains html formatters. Probably due to that the data is htmlencoded when it is sent via POST from the browser. I guess I need to make sure that the original text is stored. I will try the 1250 codepage. The languages I need to support are Swedish, Finnish, Norwegian, English and Czech.

I need to stay with the VFP reports as they are pretty complex and would take a lot of tome to convert to something else.

Thank you for this input.

RE: Problem with Eastern European languages in reports

I guess the HTML entities already come from the Browsers POST request, to only get characters of a given codepage is maybe not strictly possible from HTML input controls. When users paste in something containing UTF-8 and the webpage uses say latin1, that might cause some characters to be converted to HTML entities, so the webpage can still display them. Foxpro doesn't share that strategy.

Essentially HTML entities are those &äxyz where xyz is a decimal number in the range of 0-255, so CHR(xyz). It's not guaranteed to be the char really entered, when not using the same encoding as the webpage. Besides, there are HTML entity names like > for the greater than character. Since there are many places where the codes can change, this surely will not be straight forward. If your server side script uses functions for sanitizing user input the HTML enitities might come from there and circumvent characters used in SQL injection attacks, for the price of them only being compatible for web page output, then you should instead parameterize insert queries.

You also don't tell whether your web scripts store directly into DBFs or whether this goes through a MySQL database, then there already can be two further encoding changes from web to MySQL and then from MySQL to DBF.

It may be best to work with UTF-8 all the way including how you get the text into VFP variables and then use STRCONV() to convert from UTF-8 to the current codepage with STRCONV(utf8text,11).

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
I use winsock and take care of the POST buffer and store directly to a vfp table. Here you see how the text začátečníkům looks like in the debugger as an example:


How can I treat the texts as UTF8 all the way like you suggest? The text above is stored as "začátečníkům" as it is now. Should I sanitize the input before storing?

RE: Problem with Eastern European languages in reports

You have to make the webpage to work in UTF-8 encoding. If that's not under your control, you have to dig into decoding this here. You need to know what the webpage uses as encoding.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
I have full control over the web page and when looking in Chrome, the code page is interpreted as 1252 even though I say UTF8 in the http header:


Changing to storing strings as UTF8 is to risky as I have about 1500 strings and probably do string matching here and there. I'm thinking of storing the Czech strings including the HTML entities (as it works fine in the web environment) and strip the strings on the fly only when producing reports.

RE: Problem with Eastern European languages in reports

Mibosoft,

You can post-process the HTML entities and prepare the VFP data to be displayed using Eastern Europe encoding.

This is a small function to help you with that (not fully tested, but I think it's fairly ok):

CODE --> VFP

LOCAL Encoding AS Integer

m.Encoding = 238

? HTMLNumEntityToANSI("FBŠ SLAVIA Plzeň", m.Encoding) Font "Arial", 12, m.Encoding
? HTMLNumEntityToANSI("nápovědě", m.Encoding) Font "Arial", 12, m.Encoding

FUNCTION HTMLNumEntityToANSI (Source AS String, ANSIEncoding AS Integer)

	LOCAL Encoded AS String
	LOCAL NumEntity AS String
	LOCAL Codepoint AS String

	m.Encoded = m.Source
	m.NumEntity = STREXTRACT(m.Encoded, "&#", ";", 1, 4)
	DO WHILE !EMPTY(m.NumEntity)

		m.Codepoint = BINTOC(VAL(SUBSTR(m.NumEntity, 3)), "2RS") && this is the UNICODE codepoint for the entity
		m.Encoded = STRTRAN(m.Encoded, m.NumEntity, STRCONV(m.Codepoint, 6, m.ANSIEncoding, 2)) && make it ANSI, if possible

		m.NumEntity = STREXTRACT(m.Encoded, "&#", ";", 1, 4)
	ENDDO

	RETURN m.Encoded

ENDFUNC 

RE: Problem with Eastern European languages in reports

(OP)
This is awesome! Then I can store the strings including HTML entities and use this function to return strings used in my reports. Thanks' a lot!

RE: Problem with Eastern European languages in reports

The encoding isn't just specified by HTML, the server sends a header specifying Windows-1252, obviously.

This is something you'd need to address in the Webserver Configuration defining the HTTP Content-Type header sent with any HTTP response.

https://www.w3.org/International/articles/http-cha...

Likewise, look into the enctype attribute of HTML forms.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
I have written the server myself using winsock directly and I can change the header to UTF8 and if I do that, texts stored in tables containing the Swedish letters åäö will not be shown correctly.

I'm a bit confused here. What do I need to do to use UTF8 all the way from storing texts in tables, using in the web application, in reports, etc?

RE: Problem with Eastern European languages in reports

Of course you also need to convert your ANSI output to UTF8, just sending the header does no automatic conversion, the server has no idea what the original codepage is, so you got to feed UTF8 and expect UTF8

The big advantage using UTF8 on the complete roundtrip on the Browser side is that when users copy in texts or enter anything, this is UTF8 anyway. UTF8 allows the most input.

Only part of it can be converted to codepage 1250, but you won't have HTML entities (ϧ) in the incoming data, so you spare the conversion of that, you only need the STRCONV(input,11) to the then current codepage 1250.

You should program in 1250 and run in that codepage automatically on Eastern Windows, if not via config.fpw CODEPAGE=1250 setting.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
So I need to store all of my current strings as strconv(ANSI-STRING,9) in the table to be UTF8 and use STRCONV(TABLE-STRING,11) in reports etc. A problem is that lower()/upper() does not work on a UTF8 string which I sometimes do today in web layouts. It is not shown correctly in the browser.

RE: Problem with Eastern European languages in reports

Mibosoft,

You may use the Windows API to perform such transformations on UTF-8 strings, just make them UNICODE before calling the relevant API functions.

This example takes the Dostoevsky's name in Russian in UTF-8, displays it using an ANSI encoding, then in upper case, and finally in lower case.

CODE --> VFP

CLEAR

DECLARE INTEGER CharUpperBuffW IN User32 AS W32_Unicode_Upper ;
	STRING @ UnicodeSring, INTEGER StringLength
DECLARE INTEGER CharLowerBuffW IN User32 AS W32_Unicode_Lower ;
	STRING @ UnicodeSring, INTEGER StringLength

LOCAL Dostoevsky AS String

LOCAL DostoevskyUTF8 AS String
LOCAL DostoevskyUNICODE AS String

m.DostoevskyUTF8 = STRCONV("d094d0bed181d182d0bed0b5d0b2d181d0bad0b8d0b92c20d0a4d191d0b4d0bed18020d09cd0b8d185d0b0d0b9d0bbd0bed0b2d0b8d187", 16)

? m.DostoevskyUTF8

m.Dostoevsky = STRCONV(m.DostoevskyUTF8, 11, 204, 2)

? m.Dostoevsky Font "Arial", 12, 204

m.DostoevskyUNICODE = STRCONV(m.DostoevskyUTF8, 12)

? STRCONV(Unicode_Upper(m.DostoevskyUNICODE), 6, 204, 2) Font "Arial", 12, 204
? STRCONV(Unicode_Lower(m.DostoevskyUNICODE), 6, 204, 2) Font "Arial", 12, 204

FUNCTION Unicode_Upper (Source AS String)

	LOCAL StrBuffer AS String

	m.StrBuffer = m.Source
	W32_Unicode_Upper(@m.StrBuffer, INT(LEN(m.StrBuffer) / 2))

	RETURN m.StrBuffer

ENDFUNC

FUNCTION Unicode_Lower (Source AS String)

	LOCAL StrBuffer AS String

	m.StrBuffer = m.Source
	W32_Unicode_Lower(@m.StrBuffer, INT(LEN(m.StrBuffer) / 2))

	RETURN m.StrBuffer

ENDFUNC 

RE: Problem with Eastern European languages in reports

No, you don't have to store data in UTF-8, you just have to make the conversions on the way from DBF to the web and then reconvert to ANSI when the data comes in. You avoid HTML entities and have one less problem and the web world works best in UTF8.

Atlopes also gave you some workarounds for working with UTF-8 in VFPm but you don't need to go that far.

If you think of HTML entities of the minor problem: The characters of your DBF for a certain ASC value coming in by the HTML entity may NOT be what the user has entered. You can't really force the web browser to work in your DBF Ansi codepage, yes there also are those content encodings and browsers still support them, but you already see you have your problems with that.

If users really entered valid characters of the charset the HTML page is set to, then the form submission would send them 1:1 and not convert that to HTML entities. Do you really think users type this in that way?

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
I have a languages table with one column per language (swe,eng,fin,nor and cze). I just did this test:
1) I transformed all these strings for all languages to UTF8 with with strconv(<lang>,9).
2) I changed my server to to put UTF8 in the HTTP header (Content-Type: text/html; charset=utf-8).
3) In my reports I use strconv(<text from lang table>,11) when I fetch texts from the languages table to display it right.

Isn't this the cleanest solution?

I'm thinking of using a boolean to decide if the text fetching function shall return the string as it is (UTF8) or as strconv(<text from lang table>,11). Then I don't need to change all my reports, just set this boolean before generating them.

RE: Problem with Eastern European languages in reports

If your app has to handle multiple languages you can't display with the same ANSI codepage you're confronted with further problems. Then working the best you can in Unicode throughout maybe is a solution.

If I assume in step 1) you mean you actually store UTF8 in each language column and a browse will only show the parts of texts correctly, which are the 26 Latin letters and you make no conversion when you output this to the web and convert to ANSI in reports, that works, but you have the disadvantages working with strings you mentioned yourself.

I don't know how I'd work with this, the string functions for double-byte characters don't handle UTF-8 or any Unicode variant, they only handle some ANSI codepages with double-byte characters. Steven Black has described it very well, what effort it is to not go full Unicode but want to do it the ANSI way and enable, for example, Japanese for Non-Unicode Program in Windows (http://stevenblack.com/intlasia/), eastern European languages still are in the single-byte character sets.

You're using the right STRCONV parameters, nothing against that, but with UTF-8 you do introduce some double-byte characters that don't translate into the most general 1252 codepage, so you can't keep the advantage of not needing AT_C(), SUBSTRC(), etc. double-byte character string functions, but you also can't use them on UTF-8 strings. You just will have no fun with BROWSE of the texts and editing them.

Staying with ANSI you'd have to go the harder route to have separate tables for the languages in different codepages and your forms would still just use one current codepage, reports likewise. Regarding point 3) STRCONV will work on the assumption the <text from lang table> is in the DBF codepage if you specify the text as table.field. When you first copy the text into a string variable it will work on the assumption the variable contains the current application codepage. So for that conversion into the codepage, you will need for the language you will need to ensure the application has the right setting and then it won't support all languages. Just the ones reports can print with the current codepage. STRCONV() has additional parameters to set the target codepage, especially if the source string is Unicode or UTF-8 that's helpful, but the report will work on the current codepage anyway. Strings don't get a marker or meta data what codepage they are.

Because of that, you can not support all languages in a single application session, you can only override the usage of the Windows system codepage when you specify codepage=... in a config.fpw, CPCURRENT() will then tell you that, but there is no SET CODEPAGE to let reports run in different codepages.

For that reason, overall, I think I'd split the languages DBF into separate DBFs for each language, use the appropriate DBF codepage for the language and then at startup you can offer switching languages CPCURRENT() support, the codepage your application process is set to. And then don't store UTF-8, be able to work normally inside VFP including reports and only convert and reconvert when transitioning to the web. Even if the web is your main frontend.

There is a situation that slightly differs and which I recently tried for the first time: ISAPI (see thread184-1797409: Accessing DBF's Remotely). When you embed your web output into Apache or IIS via the foxisapi.dll and write an EXE COM Server for the web page outputs, the way the foxisapi.dll works is creating a new instance of your COM SERVER each time. If you manage that to happen with the correct codepage (I have no idea, but for example, a PHP helper script would need to swap out different config.fpw for the COM Server EXE before that gets called) you can switch the codepage used by VFP and VFP reports in every web request made.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

(OP)
Thank you Olaf for all input on this. Now I'm leaning against keeping my strings as ANSI anyway and store the CZE strings with HTML entities included. I thought of using the HTMLNumEntityToANSI() function above written by atlopes for reports (non-web environment).

alopes, could you explain:
Your example uses it like this:
? HTMLNumEntityToANSI("FBŠ SLAVIA Plze&#328;", m.Encoding) Font "Arial", 12, m.Encoding

If I use just the function call in my reports without the "Font "Arial", 12, m.Encoding" part, some letters are not transformed correctly. For example "ě" is displayed as "ì". How to solve this? I use Times New Roman in all of my reports. Do I have to hard code that?

RE: Problem with Eastern European languages in reports

(OP)
I should also mention that the language table is only for the strings that are part of the application/system. Users are also entering texts, for example team and player names. This means that Swedish, English, Finnish, Norwegian and Czech users are entering strings into the very same database tables and these text shall of course look OK together. For the web part, this is no problem with all strings in ANSI and Czech strings with HTML entities. To make the Swedish åäö display correctly, I let the server use "Content-Type: text/HTML; charset=iso-8859-1" in the HTTP header.

RE: Problem with Eastern European languages in reports

Mibosoft,

The key factor is the character set value. It can be set as part of the Font clause in a ? statement (the m.Encoding in the example you presented), or a FontCharSet property in any control that uses a font to display data, or indirectly as the assingned script in a font / GetFont() selection.

You may store data in your tables that match different ANSI character sets as long as you are a) able to identify the correct character set to display or process a particular character based field of your table; b) do not expect to mix character sets in the same field; and c) inhibit code page translation in the affected fields.

In the image below you can see a grid that represents these data:

CODE --> VFP

CREATE CURSOR test (col1 Varchar(200) NOCPTRANS, col2 Integer)

INSERT INTO test VALUES ("Hugo, Victor", 0)
INSERT INTO test VALUES ("ÇÈæ Çá&Oslash;íÈ ãÊäÈí", 178)
INSERT INTO test VALUES ("Äîñ&ograve;î¼åâñêè, Ô¼îäîð", 204) 



The simultaneous display of different scripts that can be observed resides on top of the Grid's dynamic capabilities: each script has is its own textbox control set with its own FontCharSet property.

RE: Problem with Eastern European languages in reports

(OP)
Thank you Atlopes. It works when I check "use font script" for a specific field in a report. Do I have to go through every report and every field or is there a any way to set this globally/default for all of my reports and forms?

RE: Problem with Eastern European languages in reports

The default always is the current codepage. What does that have to do with FontCharset? This:

Codepage of operating system		assigned Character Set
1250 (Central Europe)				EASTEUROPE_CHARSET (238)
1252 (Latin I)					DEFAULT_CHARSET (1)
1251 (Cyrillic)					RUSSIAN_CHARSET (204)
1253 (Greek)					GREEK_CHARSET (161)
1254 (Turkish)					TURKISH_CHARSET (162)
1257 (Baltic)					BALTIC_CHARSET (186)
1258 (Vietnam)					VIETNAMESE_CHARSET (163)
874 (Thai)					THAI_CHARSET (222)
932 (Japanese Shitf-JIS)			SHIFTJIS_CHARSET (128)
936 (Simplified Chinese)			GB2312_CHARSET (134)
950 (Traditional Chinese Big5)			CHINESEBIG5_CHARSET (136)
949 (Korean)					HANGEUL_CHARSET (129)) 

Which I found on https://www.softvision.de/fileadmin/user_upload/so...

So the major way VFP works without FontCharset most of the time is that when you only need the codepage your OS uses by default for ANSI programs. FontChrset enables to break out of that, as long as the font used (like MS Arial) supports several codepages.

So if you don't want to go through all reports for the different languages and encodings you don't set Fontcharset, but look for a way to start a VFP session in the codepage you need for a user, instead of overriding that with FontCharSet.

If you want a report to dynamically switch the only way is to not set FontCharset or you need to hack into the FRX at runtime every time.

You will be redoing what the INTL Toolkit has already done. A report with predefined label captions will not change to other languages fully anyway.



The simplest question perhaps is: Do you really need one instance of your EXE running and using two languages needing two codepages? Otherwise, split your data up into the different codepages, start the app in the codepage necessary and you don't need to fiddle with FontCharsets, you only need to transition between UTF8 and the current codepage.

If you want to support the locale of Windows users, the simplest way is to only react to CPCURRENT(). In some codepages, it becomes equivalent to a very specific language, for example in greek and in some codepages like 1252 you could support multiple languages. But if your app should work in one specific language anyway, there would be no need to make use of that special feature.

The only transition of worlds you have is between desktop and web.

That would mean dedicated DBFs and reports for each language. And that would mean no master report you can switch to another language with just a langID switch, but for the purpose of developing a single report only, you could create the different language reports in a build step instead of maintaining all of them.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

Mibosoft,

Unfortunately, as far as I know, there is no easy way to support simultaneous different ANSI code pages in the same report. The built-in dynamics features lack the character set support. You may try to multiply the text controls in your report by the supported code pages and set appropriate Print When conditions based on the value of the language field of your table or cursor.

RE: Problem with Eastern European languages in reports

(OP)
I think I will manage now by using a call to HTMLNumEntityToANSI() for (selected) strings in reports. The rest of the application is web based and here it works fine with strings with HTML entities.

Thank you both Olaf and atlopes for great help!

RE: Problem with Eastern European languages in reports

Just a last note: I tried what the effect of config.fpw CODEPAGE=n is. You get CPCURRENT() being the codepage defined in the config, eg one for eastern european languages, but _SCREEN.FontCharSet, for example, is still 1. The latter, I think, will only change when the OS also is configured for an eastern european language.

So overall, this just seems like something you can't test without really also switching the OS or actively setting the FontCharSet property of things. If you develop on a Czech Windows I guess the _SCREEN.FontCharset and the value for controls will default to 238.

I can, of course, use atlopes or INTL or West Windows codes to display many languages, but it could be simply also defaults that you only really need to set when you want to support multiple languages.

The only hurdle is, once you design your report in a Western European Windows version this will stay with FontCharset=1, this doesn't change like Theme related colors do change with the OS settings.

The HTMLEntities you still have and now want to handle with the conversion function, well, they do come from the web browser and when exactly the come in and whether that already happens at the sending side of the browser or at transport, I don't know, we don't know your web code, that would need changes to create a UTF8 page in the first place and also let the HTML form send back UTF8. You showed us the console screenshot output of document.characterset is "Window-1252". Nothing changes, as long as that stays 1252.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

I played a bit with foxisapi and it boils down to:

1. When you manage to get the document.characterSet to be "UTF-8" form data is actually not converted to HTML entities, you get UTF-8

2. When you set no enctype in your form tag it defaults to urlencoded, which means the charset used is - more or less - ASCII only. Any characters that fall in the category of special characters come over in the URL encoding form of a percent followed by hexadecimal. For example, the Cyrillic Д comes back as %D0%94. That's quite similar to HTML entities, just a bit shorter because of being hexadecimal, still a conversion you need to invert.

3. If you set enctype="multipart/form-data" you get, well, multipart form data. That looks more complicated than the usual post body but is actually easy to parse, as each form parameter comes in its own section and separate lines, ie you can use ALINES in VFP, for example. And the actual value is unchanged, which means UTF-8.

And then all you need to do is STRCONV of that parsed-out lines of parameters to the current or perhaps better to the desired codepage (UTF-8 to DBCS).

In the end, what bytes you get not only depends on the document charset, which enforces to change some of the inputs to HTML entities, but also on the URL-encoding, HTML forms do by default. Without any efforts of conversions alone the urlencoding limits text to be English.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: Problem with Eastern European languages in reports

Finally, I changed the advanced region option to use Czech for non-Unicode software to see whther that changes FontCharset as I'd assumed.

It doesn't do what I expected it to do.

When I start VFp9 normal (without a config.fpw) the cpcurrent() is still 1252. I guess that's deeper in the system or VFP installation. Also _screen.FontCharset inits as 1.
And changing codepage in config.fpw also only changes cpcurrent(), not the FontCharSet. I may expect too much of this.

But there is a very fortunate effect, that may come with this option is, that VFP now has become UTF-8 capable to me!!


OK, that's German, the important part is that there now is a Win10 beta feature that seems to promise UTF-8 support for ANSI applications (like VFP). You get there in the Region settings:


UTF-8 doesn't come in via clipboard from Unicode capable applications, but once you have it in a DBF you can copy it within VFP from browse to sourcecode and print it, too:


To get there Ipasted 漢字編碼方法, Кириллица, and čeština into a notepad++ editor and saved that as a file, which I replaced into a char field via FILETOSTR() without any further STRCONV().

So that means you may be able to work fully in UTF8, but only on Win10 with that beta feature.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close