Recommend this page to a friend! |
Classes of Caleb | PHP Common Class Library | _docs/Demojibakefier.md | Download |
|
DownloadDocumentation for the "Demojibakefier" class.Intended to normalise the character encoding of a given string to a preferred character encoding when the given string's byte sequences don't match the expectations of the preferred character encoding. Useful in cases where a block of data might conceivably be composed of several different unspecified, unknown encodings. Why the name?When a byte sequence doesn't conform to the expectations of a particular character encoding, and an attempt is made to render that byte sequence into readable characters using that particular character encoding, it can sometimes result in the appearance of generic replacement characters#Replacement_character) and "mojibake" (????). Wikipedia excerpt: > Mojibake means "character transformation" in Japanese. The word is composed of ?? (moji, IPA: [mod??i]), "character" and ?? (bake, IPA: [bäke?], pronounced "bah-keh"), "transform". Related trivia: The word "emoji" has similar etymology. ?? "Demojibakefier" is a play on the word "mojibake", so named because ideally, it should eliminate, or at least reduce the occurrence of replacement characters, mojibake, etc. What does it do?Let's start with some sample code to reproduce a potential use-case (for the purpose of the sample code, please assume that it uses UTF-8 encoding).
When executing the above sample code via a browser request, it should produce something like this (the latter part being completely unintelligible): > ???????????????????????????? ????????????????????????????????? ???????????????????????????????? ??????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ??????????????????????????? > > ??????A???p???????p????????????????????????A ????????A??????????f?????????@????????????????A ???????A?”\????p????????????T?????M??????????????????B ??????????A?v???v??????c???????????????G????????v?????A???[???????p???????????A?????L????????{???????????????Â??????^???B ?????A?t???[???]????????????????A??????[?X???I???Z???Q?l?????????????????????A?t???[?????A?j???????????????A ????t?@?C??????”\?????????M?????????????B In the case of our sample code, we already know that the latter uses SHIFT-JIS (because we're the ones that converted from UTF-8 to SHIFT-JIS), meaning that we could easily just use Let's try the same thing again, but this time, we'll pretend that we don't know which character encoding we've converted the placeholder text to. We'll pretend that the only thing we know, is that everything should be using UTF-8. We'll use the Demojibakefier to try to automatically convert it back to UTF-8, without the need for us to specify which character encoding we're converting from.
This time, it should produce something like this (note that the output of guard is the same as our original UTF-8 text): > ???????????????????????????? ????????????????????????????????? ???????????????????????????????? ??????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ??????????????????????????? > > ??????A???p???????p????????????????????????A ????????A??????????f?????????@????????????????A ???????A?”\????p????????????T?????M??????????????????B ??????????A?v???v??????c???????????????G????????v?????A???[???????p???????????A?????L????????{???????????????Â??????^???B ?????A?t???[???]????????????????A??????[?X???I???Z???Q?l?????????????????????A?t???[?????A?j???????????????A ????t?@?C??????”\?????????M?????????????B > > ???????????????????????????? ????????????????????????????????? ???????????????????????????????? ??????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ??????????????????????????? How to use:
Demojibakefier's constructor.To use the Demojibakefier, you'll firstly need to instantiate it.
Demojibakefier's constructor optionally accepts one parameter: The character encoding that it should use whenever trying to normalise data. When omitted, UTF-8 will be used. After you've instantiated the Demojibakefier, you can start demojibakefying data by using the instance's supported method.Returns an array of all the character encoding types known to and suported by the Demojibakefier.
Character encoding types currently known and suported by the Demojibakefier: - UTF-8 - UTF-16BE - UTF-16LE - ISO-8859-1 - CP1252 - ISO-8859-2 - ISO-8859-3 - ISO-8859-4 - ISO-8859-5 - ISO-8859-6 - ISO-8859-7 - ISO-8859-8 - ISO-8859-9 - ISO-8859-10 - ISO-8859-11 - ISO-8859-13 - ISO-8859-14 - ISO-8859-15 - ISO-8859-16 - CP1250 - CP1251 - CP1253 - CP1254 - CP1255 - CP1256 - CP1257 - CP1258 - GB18030 - GB2312 - BIG5 - SHIFT-JIS - JOHAB - UCS-2 - UTF-32BE - UTF-32LE - UCS-4 - CP437 - CP737 - CP775 - CP775 - CP775 - CP775 - CP775 - CP850 - CP852 - CP855 - CP857 - CP860 - CP861 - CP862 - CP863 - CP864 - CP865 - CP866 - CP869 - CP874 - KOI8-RU - KOI8-R - KOI8-U - KOI8-F - KOI8-T - CP037 - CP500 - CP858 - CP875 - CP1026 Note that the reliability of the Demojibakefier's ability to normalise strings, and of using it to convert a string between two particular character encoding types, can vary significantly, depending on the character encoding types in question, the length and nature of the string in question, among other factors. Note also that the Demojibakefier doesn't possess the same qualities as a linguistic translator, and isn't designed to test the intelligibility of strings beyond the conformity of their byte sequences to the various character encoding types supported by the class, or beyond the few rudimentary heuristics that it implements (such as the comparative likelihood of particular byte sequences occurring within the kinds of texts that typically utilise particular character encoding types). This means that an entirely unintelligible string could be regarded as already conformant, and therefore potentially not normalised, as long as its byte sequence conforms to that expected by the instance's target character encoding, or that an entirely unintelligible string could theoretically be produced by the Demojibakefier, as long as the provided string is not already conformant to the instance's target character encoding, but conforms to one or more of the other character encoding types supported by the class, passes all heuristics, and successfully reads unintelligibly in the character encoding types that it conforms with. checkConformity method.Checks for byte sequences that shouldn't normally appear in a specified character encoding (the second parameter) as a means of roughly guessing whether the string (the first parameter) likely conforms to the specified character encoding. The second parameter is optional, defaulting to instance's default character encoding when omitted (the character encoding provided to the constructor at instantiation, or UTF-8 when none was provided). Returns true when the string conforms (per specs), or false otherwise.
weigh method.Attempts to apply weighting to potential character encoding candidates based on the frequency/occurrence of specific byte sequences and lack thereof within a string. Method is private and thus shouldn't be called by the implementation.
dropVariants method.Drops candidates belonging to encodings that are outdated subsets or variants of other encodings with valid candidates. Method is private and thus shouldn't be called by the implementation.
shannonEntropy method.Calculates the shannon entropy of a string (the sole accepted parameter). This method isn't used by any current versions of the Demojibakefier, but its use is planned for a future version.
normalise method.Attempts to normalise a string (the sole accepted parameter), returning the string normalised, or the string verbatim when it can't be reliably, confidently returned normalised, when the string's byte sequence already conforms to the target character encoding, or when the string is empty.
When guard method.The Demojibakefier heavily relies upon PHP's
Last member.The
Example usage:
CIDRAM and phpMussel do something similar on the front-end logs page, to inform users when log entry fields have been transformed by the Demojibakefier. Len member.The
Last Updated: 22 May 2019 (2019.05.22). |