INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Usefull Functions & Procedures

Validating a string is UTF-8 encoded by Olaf Doschke
Posted: 7 Feb 17 (Edited 7 Feb 17)

I just made a function expecting UTF-8 input and while it's easy to guarantee UTF-8 by utf8string=STRCONV(ansistring,11), if you don't know whether the initial string is Ansi, you can't simply convert this way. You would possibly double convert a UTF-8 string to something also UTF-8 but not encoding what actually should be encoded.

One nice feature of UTF-8 is its lowest 128 characters are identical to ASCII and that's also including the typical Ansi 1252 codepage covering most latin languages normal texts with many punctuation characters and numbers.

This is not the final truth about a string encoding, as just said it will also see any ASCII string not using chars above 0x80 as UTF-8, which of course only is half the truth. It'll prevent you to not double UTF-8 convert a string, for which I mainly wrote it. The idea is based on the german Wikipedia article on UTF-8 encoding on the one side and to write somthing faster than what Universal Thread offers as Download ID 40215 on the other side. The way Jose Enrique Llopis implements his validation function makes use of expensive bit shift operations to check codepoint validity of some valid (or invalid) multi byte characters. The Wikipedia article handles that by simple checking of some corner values of first and second byte, much easier and faster checks.

CODE --> VFP

Function  ValidateUtf8()
   LParameters tcString
   * This function is independent of current codepage and handles tcString as a binary string value.
   * Se we act on tcString with single byte functions LEN, SUBSTR etc. only!
   * We want to examine the single byte values, not let VFP detect some other codepages 
   * double byte characters or such. Also refer to 
   * ANSI, DBCS, and Unicode in VB: https://msdn.microsoft.com/en-us/library/aa261360(v=vs.60).aspx

   LOCAL lnStringBytes, lnCounter, lnFirstByte, lnSecondByte, lnAdditionalBytes, lnSecondary
   lnStringBytes = LEN(m.tcString)
   FOR lnCounter = 1 TO m.lnStringBytes
       lnFirstByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
       IF m.lnFirstByte<0x80 && that's ok and identical to ASCII
          LOOP
       ENDIF
       IF m.lnFirstByte>0xF4 
          * >0xF4 would either mean 4 bytes with unallowed codepoint>1400000 
          * or longer characters, in any case not allowed.
          RETURN -1 && too high first byte (todo: more exact distinction -2 for codepoint>1400000)
       Endif

       * codepoint validity
       IF InList(m.lnFirstByte, 0xC0, 0xC1)
          * C0 or C1 would mean 2 byte char with 11 codepoint bits.
          * First byte has 5 codepoint bits (110 +5bits).
          * Second byte has 6 codepoint bits (10 +6bits).
          * With first byte C0 (110 0000 0) or C1 (110 0000 1) 
          * the upper 4 of 11 codepoint bits would be 0, leaving 7 codepoint bits.
          * That's codepoint values 0x00-0x7f and should be encoded with 1 byte!
          RETURN -2 && codepoint error
       Endif
       DO Case
          CASE BITAND(m.lnFirstByte, 0xE0)=0xC0 && starts with 110, 2 Bytes Character
             IF m.lnCounter = m.lnStringBytes
                RETURN -3 && exceeding length
             ENDIF
             * First byte not C0 or C1 already checked,
             * normal check of 2nd byte 
             lnCounter = m.lnCounter + 1
             lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
             IF BITAND(m.lnSecondByte,0xC0)#0x80
                RETURN -4 && wrong secondary byte value
             Endif
          CASE BITAND(m.lnFirstByte, 0xF0)=0xE0 && starts with 1110, 3 Bytes Character
             IF m.lnCounter+2 > m.lnStringBytes
                RETURN -3 && exceeding length
             ENDIF
             * check of 2nd byte
             lnCounter = m.lnCounter + 1
             lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
             IF BITAND(m.lnSecondByte,0xC0)#0x80
                RETURN -4 && wrong secondary byte value
             Endif
             * codepoint validity
             IF m.lnFirstByte = 0xE0 AND NOT BETWEEN(m.lnSecondByte,0xA0,0xBF)
                * with first byte 0xE0 second byte must be 0xA0 to 0xBF 
                * for similar reasons as stated above for 2 byte characters.
                RETURN -2 && codepoint error
             ENDIF
             IF m.lnFirstByte = 0xED AND NOT BETWEEN(m.lnSecondByte,0x80,0x9F)
                * with first byte 0xED second byte must be 0x80 to 0x9F
                * for similar reasons as stated above for 2 byte characters.
                RETURN -2 && codepoint error
             ENDIF
             * check of 3rd byte
             m.lnCounter = m.lnCounter + 1
             IF BITAND(ASC(SUBSTR(m.tcString,m.lnCounter,1)),0xC0)#0x80
                RETURN -4 && wrong secondary byte value
             Endif
          CASE BITAND(m.lnFirstByte, 0xF8)=0xF0 && starts with 11110, 4 Bytes Character
             IF m.lnCounter+3 > m.lnStringBytes
                RETURN -3 && exceeding length
             Endif        
             * check of 2nd byte
             lnCounter = m.lnCounter + 1
             lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
             IF BITAND(m.lnSecondByte,0xC0)#0x80
                RETURN -4 && wrong secondary byte value
             Endif               
             * codepoint validity
             IF m.lnFirstByte = 0xF0 AND NOT BETWEEN(m.lnSecondByte,0x90,0xBF)
                * with first byte 0xF0 second byte must be 0x90 to 0xBF 
                * for similar reasons as stated above for 2-3 byte characters.
                RETURN -2 && codepoint error
             ENDIF
             IF m.lnFirstByte = 0xF4 AND NOT BETWEEN(m.lnSecondByte,0x80,0x8F)
                * with first byte 0xF4 second byte must be 0x80 to 0x8F
                * for similar reasons as stated above for 2-3 byte characters.
                RETURN -2 && codepoint error
             ENDIF
             * check of byte 3 and 4
             FOR lnSecondary = 1 TO 2
                IF BITAND(ASC(SUBSTR(m.tcString,m.lnCounter+m.lnSecondary,1)),0xC0)#0x80
                   RETURN -4 && wrong secondary byte value
                Endif
             ENDFOR 
             m.lnCounter = m.lnCounter + 2
          OTHERWISE
             * too many bits set, actually can't happen, as first byte >0xF4 
             * already is checked, still here for making sure.
             RETURN -1 && wrong first byte
       ENDCASE
   ENDFOR 
   
   * As just said, checks for exceeding m.lnStringBytes are already done.
   * But to make sure, after a VFP for loop (with default step 1) 
   * the loop variable has to be exactly upper value+1, neither lower nor higher.
   RETURN IIF(m.lnCounter = m.lnStringBytes+1,0,-3) && 0 = ok, -3 = exceeding length
EndFunc 

Use this to test (expected results in brackets):

CODE --> VFP

* check coverage of all branches by logging with COVERAGE 
* and examining the log with coverage profiler (Tools menu)
* SET COVERAGE TO (ADDBS(GETENV("TEMP"))+"coverage.log")
Clear

* test cases, covering all code execution branches (trying to)
* with both valid and invalid UTF8 strings.
lcString = ''+0hF4808080
?    '4 byte char corner case1 (0)', ValidateUtf8( lcString )

lcString = ''+0hF48FBFBF
?    '4 byte char corner case2 (0)', ValidateUtf8( lcString )

lcString = ''+0hF480808080
?    'first byte 80 after allowed 4 byte char (-1)', ValidateUtf8( lcString )

lcString = ''+0hF8808080
?    'too long char1 (-1)', ValidateUtf8( lcString )

lcString = ''+0hFC808080
?    'too long char2 (-1)', ValidateUtf8( lcString )

lcString = ''+0hC080
?    'first byte c0 (-2)', ValidateUtf8( lcString )

lcString = ''+0hC180
?    'first byte c1 (-2)', ValidateUtf8( lcString )

lcString = ''+0hC2
?    'two byte char with only 1 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hE07f80
?    'three byte char with bad 2nd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hE080
?    'three byte char with only 2 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf080
?    'four byte char with only 2 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf08080
?    'four byte char with only 3 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf080807f
?    'four byte char with bad 4th byte (-2)', ValidateUtf8( lcString )

lcString = ''+0he09f80
?    'three byte char with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0he0bf80
?    'three byte char with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0heda080
?    'three byte char with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hed9f80
?    'three byte char with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf0908080
?    'four byte char (f0) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf0bf8080
?    'four byte char (f0) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf08f8080
?    'four byte char (f0) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4808080
?    'four byte char (f4) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf48f8080
?    'four byte char (f4) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf4908080
?    'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4bf8080
?    'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4bf8080
?    'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0he1bfc0
?    'three byte char with invalid 3rd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf3c08080
?    'four byte char with invalid 2nd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf380c080
?    'four byte char with invalid 4rd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf38080c0
?    'four byte char with invalid 4th byte (-4)', ValidateUtf8( lcString )
*SET COVERAGE TO 
* noe check log via coverage profiler. 

Final usage for converting a string to 'definite' UTF-8 (in some strange definition of 'definite' not meaning any initial string value can be turned to readable UTF-8. Just stating "don't expect too much").

CODE --> VFP

LOCAL lcSomestring, lcUTF8String 
lcSomestring = "..." &&any string value read from anywhere 

lcUTF8String = IIf(ValidateUtf8(lcSomestring )=0,lcSomestring ,StrConv(lcSomestring ,9)) 

What this does NOT do: input is unicode, output is UTF-8 representation of the same string (visually). If you input Unicode the final string will either be same (if the Unicode representation has no invalid byte combinations interpreted as UTF-8), or will be somehow converted to "UTF-8". The final string then doesn't have any illegal byte combinations for UTF-8, but surely will not be readable as if converting to Ansi via STRCONV(unicode,6) for example. The usability is limited, but it'll surely prevent double UTF-8 conversion. For example turning german Umlauts and sz ligature "äöüß" to äöüßinstead of the correct äöüß.

Error numbers returned from the function (as explained in the source code):
-1: too high first byte
-2: codepoint invalidity (invalidity of the net character code besides UTF-8 "mark" bits)
-3: exceeding length (expected length exceeds string length, can also cause a continuation error in validating the next character as it can only happen at the end of the full string).
-4: wrong secondary byte value (all secondary bytes after the start byte of a UTF char must begin with bits 10 (range 0x80-0xBF = 10000000-10111111)

Bye, Olaf.

PS: Come back later. After extensive performance tests it might be improved not even using BITAND operations.

Notice: This does not validate UTF-8 txt files written by notepad saved as UTF-8, because notepad adds a Byte Order Mark at the begin of the file, which isn't valid UTF-8, the net UTF-8 content only is after this BOM, so that has to be removed before passed in here. Same applies to other editors using this BOM mechanism. By the way: This often also is a root problem of XML validation.

Back to Microsoft: Visual FoxPro FAQ Index
Back to Microsoft: Visual FoxPro Forum

My Archive

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close