Class UTF8 Readme
Discussion:
UTF-8 is a widely accepted character encoding scheme. Its genius lies in two
special characteristics: It encompasses ASCII (7-bit) encoding without any
changes, thus making it backward-compatible with the overwhelming majority of
western data sets, both modern and ancient. And it is self-evident requiring
no special programming to use. UTF-8 is amazingly expansive, offering so
many character interpretations that it can represent any character in any
human language.
UTF-8 characters may "collide" with extended-ASCII (also called ANSI) because
the extended-ASCII uses one-byte characters above code point 7F. The high
order bit of a byte is of significance in the UTF-8 encoding scheme. UTF-8,
therefore, has different (multi-byte) encoding for the ANSI characters in the
range from 80 to FF (128 to 255). For example, the copyright symbol, a little
letter "c" in a circle, is produced at ANSI code point hexadecimal A9 (169).
This same symbol is represented by a two-byte encoding in UTF8: C2A9.
The overwhelming majority of UTF-8 errors arise when extended-ASCII characters
are passed to algorithms that expect UTF-8. Many European accented letters
and common symbols are represented in ISO-8859-1 via the one-byte range from
hex 80 to hex FF. These characters cannot be used in XML or JSON. They must
either be converted to entities or converted to UTF-8 multi-byte characters.
PHP native functions exist to convert between extended-ASCII and UTF-8, (and
other encoding schemes), but these native functions do not understand the
encoding scheme inherent in their input. It is our obligation as programmers
to know the encoding scheme of any data we receive. It is our obligation as
programmers to produce our data in a well-identified and predictable encoding
scheme. The best and most widely accepted scheme is UTF-8.
PHP has had internal support for UTF-8 since PHP 5.6+, and it is now the
default character encoding.
Operation:
This class constructor receives three arguments: (1) a string, (2) a boolean
telling whether to attempt to decode ISO-8859-1 (default FALSE), (3) a
boolean telling whether to remove any Byte-Order Mark (default TRUE). The
constructor returns an object containing the string and a validity indicator.
If the string fails UTF-8 validation, the offset location of the failures
may be provided in an array in the "error" property. The byte length and
character count are also returned. If the "error" property is empty, the
"str" property is valid UTF-8, and the byte length and character count are
probably accurate. However if the class is given unpredictable data and is
asked to decode ISO-8859-1, garbled output may occur. This is an unavoidable
artifact of changing character set encoding without an understanding of the
existing character set encoding.
UTF-8 does not require or benefit from a Byte-Order Mark, yet some programs
(eg: Microsoft Notepad) will still put a BOM into their files. This class
will, by default, remove the unnecessary and unwanted BOM(s), if any, from
the input strings.
A method of the class, "extended_ascii_to_utf8()" provides a conversion
that is more accurate than the native PHP functions.
See the "demo" script for examples.
References:
https://www.joelonsoftware.com/articles/Unicode.html (Old but wonderful)
https://iconoun.com/articles/collisions/ (My take on the issues)
https://stackoverflow.com/a/11709412 (Tony Ferrara did good work here)
https://www.unicode.org/versions/Unicode11.0.0/
https://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences
http://php.net/manual/en/book.mbstring.php
http://php.net/manual/en/function.utf8-encode.php
http://php.net/manual/en/function.chr.php
http://www.asciitable.com/
http://en.wikipedia.org/wiki/UTF-8
http://php.net/manual/en/function.mb-detect-encoding.php#112391
|