Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Validating a string is UTF-8 encoded

Usefull Functions & Procedures

Validating a string is UTF-8 encoded

by  Olaf Doschke  Posted    (Edited  )
I just made a function expecting UTF-8 input and while it's easy to guarantee UTF-8 by utf8string=STRCONV(ansistring,11), if you don't know whether the initial string is Ansi, you can't simply convert this way. You would possibly double convert a UTF-8 string to something also UTF-8 but not encoding what actually should be encoded.

One nice feature of UTF-8 is its lowest 128 characters are identical to ASCII and that's also including the typical Ansi 1252 codepage covering most latin languages normal texts with many punctuation characters and numbers.

This is not the final truth about a string encoding, as just said it will also see any ASCII string not using chars above 0x80 as UTF-8, which of course only is half the truth. It'll prevent you to not double UTF-8 convert a string, for which I mainly wrote it. The idea is based on the german Wikipedia article on UTF-8 encoding on the one side and to write somthing faster than what Universal Thread offers as Download ID 40215 on the other side. The way Jose Enrique Llopis implements his validation function makes use of expensive bit shift operations to check codepoint validity of some valid (or invalid) multi byte characters. The Wikipedia article handles that by simple checking of some corner values of first and second byte, much easier and faster checks.

[code VFP]Function ValidateUtf8()
LParameters tcString
* This function is independent of current codepage and handles tcString as a binary string value.
* Se we act on tcString with single byte functions LEN, SUBSTR etc. only!
* We want to examine the single byte values, not let VFP detect some other codepages
* double byte characters or such. Also refer to
* ANSI, DBCS, and Unicode in VB: https://msdn.microsoft.com/en-us/library/aa261360(v=vs.60).aspx

LOCAL lnStringBytes, lnCounter, lnFirstByte, lnSecondByte, lnAdditionalBytes, lnSecondary
lnStringBytes = LEN(m.tcString)
FOR lnCounter = 1 TO m.lnStringBytes
lnFirstByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
IF m.lnFirstByte<0x80 && that's ok and identical to ASCII
LOOP
ENDIF
IF m.lnFirstByte>0xF4
* >0xF4 would either mean 4 bytes with unallowed codepoint>1400000
* or longer characters, in any case not allowed.
RETURN -1 && too high first byte (todo: more exact distinction -2 for codepoint>1400000)
Endif

* codepoint validity
IF InList(m.lnFirstByte, 0xC0, 0xC1)
* C0 or C1 would mean 2 byte char with 11 codepoint bits.
* First byte has 5 codepoint bits (110 +5bits).
* Second byte has 6 codepoint bits (10 +6bits).
* With first byte C0 (110 0000 0) or C1 (110 0000 1)
* the upper 4 of 11 codepoint bits would be 0, leaving 7 codepoint bits.
* That's codepoint values 0x00-0x7f and should be encoded with 1 byte!
RETURN -2 && codepoint error
Endif
DO Case
CASE BITAND(m.lnFirstByte, 0xE0)=0xC0 && starts with 110, 2 Bytes Character
IF m.lnCounter = m.lnStringBytes
RETURN -3 && exceeding length
ENDIF
* First byte not C0 or C1 already checked,
* normal check of 2nd byte
lnCounter = m.lnCounter + 1
lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
IF BITAND(m.lnSecondByte,0xC0)#0x80
RETURN -4 && wrong secondary byte value
Endif
CASE BITAND(m.lnFirstByte, 0xF0)=0xE0 && starts with 1110, 3 Bytes Character
IF m.lnCounter+2 > m.lnStringBytes
RETURN -3 && exceeding length
ENDIF
* check of 2nd byte
lnCounter = m.lnCounter + 1
lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
IF BITAND(m.lnSecondByte,0xC0)#0x80
RETURN -4 && wrong secondary byte value
Endif
* codepoint validity
IF m.lnFirstByte = 0xE0 AND NOT BETWEEN(m.lnSecondByte,0xA0,0xBF)
* with first byte 0xE0 second byte must be 0xA0 to 0xBF
* for similar reasons as stated above for 2 byte characters.
RETURN -2 && codepoint error
ENDIF
IF m.lnFirstByte = 0xED AND NOT BETWEEN(m.lnSecondByte,0x80,0x9F)
* with first byte 0xED second byte must be 0x80 to 0x9F
* for similar reasons as stated above for 2 byte characters.
RETURN -2 && codepoint error
ENDIF
* check of 3rd byte
m.lnCounter = m.lnCounter + 1
IF BITAND(ASC(SUBSTR(m.tcString,m.lnCounter,1)),0xC0)#0x80
RETURN -4 && wrong secondary byte value
Endif
CASE BITAND(m.lnFirstByte, 0xF8)=0xF0 && starts with 11110, 4 Bytes Character
IF m.lnCounter+3 > m.lnStringBytes
RETURN -3 && exceeding length
Endif
* check of 2nd byte
lnCounter = m.lnCounter + 1
lnSecondByte = ASC(SUBSTR(m.tcString, m.lnCounter, 1))
IF BITAND(m.lnSecondByte,0xC0)#0x80
RETURN -4 && wrong secondary byte value
Endif
* codepoint validity
IF m.lnFirstByte = 0xF0 AND NOT BETWEEN(m.lnSecondByte,0x90,0xBF)
* with first byte 0xF0 second byte must be 0x90 to 0xBF
* for similar reasons as stated above for 2-3 byte characters.
RETURN -2 && codepoint error
ENDIF
IF m.lnFirstByte = 0xF4 AND NOT BETWEEN(m.lnSecondByte,0x80,0x8F)
* with first byte 0xF4 second byte must be 0x80 to 0x8F
* for similar reasons as stated above for 2-3 byte characters.
RETURN -2 && codepoint error
ENDIF
* check of byte 3 and 4
FOR lnSecondary = 1 TO 2
IF BITAND(ASC(SUBSTR(m.tcString,m.lnCounter+m.lnSecondary,1)),0xC0)#0x80
RETURN -4 && wrong secondary byte value
Endif
ENDFOR
m.lnCounter = m.lnCounter + 2
OTHERWISE
* too many bits set, actually can't happen, as first byte >0xF4
* already is checked, still here for making sure.
RETURN -1 && wrong first byte
ENDCASE
ENDFOR

* As just said, checks for exceeding m.lnStringBytes are already done.
* But to make sure, after a VFP for loop (with default step 1)
* the loop variable has to be exactly upper value+1, neither lower nor higher.
RETURN IIF(m.lnCounter = m.lnStringBytes+1,0,-3) && 0 = ok, -3 = exceeding length
EndFunc
[/code]

Use this to test (expected results in brackets):
[code VFP]* check coverage of all branches by logging with COVERAGE
* and examining the log with coverage profiler (Tools menu)
* SET COVERAGE TO (ADDBS(GETENV("TEMP"))+"coverage.log")
Clear

* test cases, covering all code execution branches (trying to)
* with both valid and invalid UTF8 strings.
lcString = ''+0hF4808080
? '4 byte char corner case1 (0)', ValidateUtf8( lcString )

lcString = ''+0hF48FBFBF
? '4 byte char corner case2 (0)', ValidateUtf8( lcString )

lcString = ''+0hF480808080
? 'first byte 80 after allowed 4 byte char (-1)', ValidateUtf8( lcString )

lcString = ''+0hF8808080
? 'too long char1 (-1)', ValidateUtf8( lcString )

lcString = ''+0hFC808080
? 'too long char2 (-1)', ValidateUtf8( lcString )

lcString = ''+0hC080
? 'first byte c0 (-2)', ValidateUtf8( lcString )

lcString = ''+0hC180
? 'first byte c1 (-2)', ValidateUtf8( lcString )

lcString = ''+0hC2
? 'two byte char with only 1 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hE07f80
? 'three byte char with bad 2nd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hE080
? 'three byte char with only 2 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf080
? 'four byte char with only 2 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf08080
? 'four byte char with only 3 byte (-3)', ValidateUtf8( lcString )

lcString = ''+0hf080807f
? 'four byte char with bad 4th byte (-2)', ValidateUtf8( lcString )

lcString = ''+0he09f80
? 'three byte char with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0he0bf80
? 'three byte char with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0heda080
? 'three byte char with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hed9f80
? 'three byte char with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf0908080
? 'four byte char (f0) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf0bf8080
? 'four byte char (f0) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf08f8080
? 'four byte char (f0) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4808080
? 'four byte char (f4) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf48f8080
? 'four byte char (f4) with valid codepoint (0)', ValidateUtf8( lcString )

lcString = ''+0hf4908080
? 'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4bf8080
? 'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0hf4bf8080
? 'four byte char (f4) with invalid codepoint (-2)', ValidateUtf8( lcString )

lcString = ''+0he1bfc0
? 'three byte char with invalid 3rd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf3c08080
? 'four byte char with invalid 2nd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf380c080
? 'four byte char with invalid 4rd byte (-4)', ValidateUtf8( lcString )

lcString = ''+0hf38080c0
? 'four byte char with invalid 4th byte (-4)', ValidateUtf8( lcString )
*SET COVERAGE TO
* noe check log via coverage profiler.[/code]

Final usage for converting a string to 'definite' UTF-8 (in some strange definition of 'definite' not meaning any initial string value can be turned to readable UTF-8. Just stating "don't expect too much").

[code VFP]LOCAL lcSomestring, lcUTF8String
lcSomestring = "..." &&any string value read from anywhere

lcUTF8String = IIf(ValidateUtf8(lcSomestring )=0,lcSomestring ,StrConv(lcSomestring ,9))[/code]

What this does NOT do: input is unicode, output is UTF-8 representation of the same string (visually). If you input Unicode the final string will either be same (if the Unicode representation has no invalid byte combinations interpreted as UTF-8), or will be somehow converted to "UTF-8". The final string then doesn't have any illegal byte combinations for UTF-8, but surely will not be readable as if converting to Ansi via STRCONV(unicode,6) for example. The usability is limited, but it'll surely prevent double UTF-8 conversion. For example turning german Umlauts and sz ligature "äöüß" to äöüßinstead of the correct äöüß.

Error numbers returned from the function (as explained in the source code):
-1: too high first byte
-2: codepoint invalidity (invalidity of the net character code besides UTF-8 "mark" bits)
-3: exceeding length (expected length exceeds string length, can also cause a continuation error in validating the next character as it can only happen at the end of the full string).
-4: wrong secondary byte value (all secondary bytes after the start byte of a UTF char must begin with bits [highlight #FCE94F]10[/highlight] (range 0x80-0xBF = [highlight #FCE94F]10[/highlight]000000-[highlight #FCE94F]10[/highlight]111111)

Bye, Olaf.

PS: Come back later. After extensive performance tests it might be improved not even using BITAND operations.

Notice: This does not validate UTF-8 txt files written by notepad saved as UTF-8, because notepad adds a Byte Order Mark at the begin of the file, which isn't valid UTF-8, the net UTF-8 content only is after this BOM, so that has to be removed before passed in here. Same applies to other editors using this BOM mechanism. By the way: This often also is a root problem of XML validation.
Register to rate this FAQ  : BAD 1 2 3 4 5 6 7 8 9 10 GOOD
Please Note: 1 is Bad, 10 is Good :-)

Part and Inventory Search

Back
Top