https://wiki.whatwg.org/api.php?action=feedcontributions&user=Smontagu&feedformat=atom
WHATWG Wiki - User contributions [en]
2024-03-29T10:06:22Z
User contributions
MediaWiki 1.39.3
https://wiki.whatwg.org/index.php?title=Web_Encodings&diff=3970
Web Encodings
2009-08-23T15:40:29Z
<p>Smontagu: /* Firefox */ Updated to Firefox 3.5.2 source; removed aliases that appear in the properties file but are not implemented</p>
<hr />
<div>My scratchpad for encoding related notes.<br />
<br />
== Goals ==<br />
<br />
* Document existing practices for<br />
** Supported encodings<br />
** Supported aliases<br />
** Supported matching algorithm<br />
* Converge the various used algorithms<br />
* Get the new rules implemented<br />
<br />
== Current Implementations ==<br />
<br />
Does this differ per platform? Opera might differ a bit on Mac.<br />
<br />
=== Data ===<br />
<br />
Integrate this awesome data somehow:<br />
<br />
* http://coq.no/character-tables/mime/en<br />
* http://coq.no/character-tables/mime/iso-2022/en<br />
* http://coq.no/character-tables/mime/euc/en<br />
* http://coq.no/character-tables/mime/locale-specific/en<br />
<br />
=== Opera ===<br />
<br />
==== Matching ====<br />
<br />
UTS22 and ASCII lowercasing.<br />
<br />
==== Encodings ====<br />
<br />
{|border=1 cellpadding=4 cellspacing=0<br />
!| Encoding<br />
!| Aliases<br />
!| Decoded As<br />
!| Notes<br />
|-<br />
|-<br />
| big5<br />
| big5, cnbig5, csbig5<br />
|<br />
|<br />
|-<br />
| big5-hkscs<br />
| big5hkscs<br />
|<br />
|<br />
|-<br />
| euc-jp<br />
| cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese<br />
|<br />
|<br />
|-<br />
| euc-kr<br />
| cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949<br />
|<br />
|<br />
|-<br />
| euc-tw<br />
| euctw<br />
|<br />
|<br />
|-<br />
| gb18030<br />
| gb18030<br />
|<br />
|<br />
|-<br />
| gbk<br />
| chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936<br />
|<br />
|<br />
|-<br />
| hz-gb-2312<br />
| hzgb2312<br />
|<br />
|<br />
|-<br />
| ibm866<br />
| 866, cp866, csibm866, ibm866<br />
|<br />
|<br />
|-<br />
| iso-2022-cn<br />
| iso2022cn<br />
|<br />
|<br />
|-<br />
| iso-2022-jp<br />
| csiso2022jp, iso2022jp<br />
|<br />
|<br />
|-<br />
| iso-2022-jp-1<br />
| iso2022jp1<br />
|<br />
|<br />
|-<br />
| iso-2022-kr<br />
| csiso2022kr, iso2022kr<br />
|<br />
|<br />
|-<br />
| iso-8859-1<br />
| cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1<br />
| windows-1252<br />
|<br />
|-<br />
| iso-8859-2<br />
| csisolatin2, iso88592, iso885921987, isoir101, l2, latin2<br />
|<br />
|<br />
|-<br />
| iso-8859-3<br />
| csisolatin3, iso88593, iso885931988, isoir109, l3, latin3<br />
|<br />
|<br />
|-<br />
| iso-8859-4<br />
| csisolatin4, iso88594, iso885941988, isoir110, l4, latin4<br />
|<br />
|<br />
|-<br />
| iso-8859-5<br />
| csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144<br />
|<br />
|<br />
|-<br />
| iso-8859-6<br />
| arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127<br />
|<br />
|<br />
|-<br />
| iso-8859-6-i<br />
| csiso88596i, iso88596i<br />
|<br />
|<br />
|-<br />
| iso-8859-7<br />
| csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126<br />
|<br />
|<br />
|-<br />
| iso-8859-8<br />
| csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual<br />
|<br />
|<br />
|-<br />
| iso-8859-8-i<br />
| csiso88598i, iso88598i<br />
|<br />
|<br />
|-<br />
| iso-8859-9<br />
| csisolatin5, iso88599, iso885991989, isoir148, l5, latin5<br />
|<br />
|<br />
|-<br />
| iso-8859-10<br />
| csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6<br />
|<br />
|<br />
|-<br />
| iso-8859-11<br />
| iso885911, tis620, tis6202533, windows874<br />
|<br />
|<br />
|-<br />
| iso-8859-13<br />
| iso885913<br />
|<br />
|<br />
|-<br />
| iso-8859-14<br />
| iso885914, iso8859141998, isoceltic, isoir199, l8, latin8<br />
|<br />
|<br />
|-<br />
| iso-8859-15<br />
| iso885915, latin9<br />
|<br />
|<br />
|-<br />
| iso-8859-16<br />
| iso885916, iso8859162001, isoir226, l10, latin10<br />
|<br />
|<br />
|-<br />
| koi8-r<br />
| cskoi8r, koi8r<br />
|<br />
|<br />
|-<br />
| koi8-u<br />
| koi8u<br />
|<br />
|<br />
|-<br />
| macintosh<br />
| csmacintosh, mac, macintosh, macroman<br />
|<br />
| Likely disabled.<br />
|-<br />
| shift_jis<br />
| cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j<br />
|<br />
|<br />
|-<br />
| tcvn<br />
| tcvn, viettcvn<br />
|<br />
|<br />
|-<br />
| us-ascii<br />
| ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii<br />
| windows-1252<br />
|<br />
|-<br />
| utf-16<br />
| csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16<br />
|<br />
|<br />
|-<br />
| utf-16be<br />
| utf16be<br />
|<br />
|<br />
|-<br />
| utf-16le<br />
| utf16le<br />
|<br />
|<br />
|-<br />
| utf-8<br />
| utf8<br />
|<br />
|<br />
|-<br />
| viscii<br />
| csviscii, viscii<br />
|<br />
|<br />
|-<br />
| windows-1250<br />
| cp1250, microsoftcp1250, windows1250<br />
|<br />
|<br />
|-<br />
| windows-1251<br />
| cp1251, microsoftcp1251, windows1251<br />
|<br />
|<br />
|-<br />
| windows-1252<br />
| cp1252, microsoftcp1252, windows1252<br />
|<br />
|<br />
|-<br />
| windows-1253<br />
| cp1253, microsoftcp1253, windows1253<br />
|<br />
|<br />
|-<br />
| windows-1254<br />
| cp1254, microsoftcp1254, windows1254<br />
|<br />
|<br />
|-<br />
| windows-1255<br />
| cp1255, microsoftcp1255, windows1255<br />
|<br />
|<br />
|-<br />
| windows-1256<br />
| cp1256, microsoftcp1256, windows1256<br />
|<br />
|<br />
|-<br />
| windows-1257<br />
| cp1257, microsoftcp1257, windows1257<br />
|<br />
|<br />
|-<br />
| windows-1258<br />
| cp1258, microsoftcp1258, windows1258<br />
|<br />
|<br />
|-<br />
| windows-sami-2<br />
| samiws2, windowssami2, ws2<br />
|<br />
|<br />
|-<br />
| x-mac-ce<br />
| macce<br />
|<br />
| Likely disabled.<br />
|-<br />
| x-mac-cyrillic<br />
| maccyrillic<br />
|<br />
| Likely disabled.<br />
|-<br />
| x-mac-greek<br />
| macgreek<br />
|<br />
| Likely disabled.<br />
|-<br />
| x-mac-turkish<br />
| macturkish<br />
|<br />
| Likely disabled.<br />
|-<br />
| x-vps<br />
| vps<br />
|<br />
|<br />
|}<br />
<br />
=== Firefox ===<br />
<br />
==== Matching ====<br />
<br />
ASCII lowercasing.<br />
<br />
==== Encodings ====<br />
<br />
{|border=1 cellpadding=4 cellspacing=0<br />
!| Encoding<br />
!| Aliases<br />
!| Decoded As<br />
!| Notes<br />
|-<br />
|-<br />
| armscii-8<br />
| armscii-8<br />
|<br />
|<br />
|-<br />
| Big5<br />
| big5, csbig5, x-x-big5, zh_tw-big5<br />
|<br />
|<br />
|-<br />
| Big5-HKSCS<br />
| big5-hkscs<br />
|<br />
|<br />
|-<br />
| EUC-JP<br />
| cseucjpkdfmtjapanese, euc-jp, x-euc-jp<br />
|<br />
|<br />
|-<br />
| EUC-KR<br />
| 5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601<br />
| x-windows-949<br />
|<br />
|-<br />
| gb18030<br />
| gb18030<br />
|<br />
|<br />
|-<br />
| GB2312<br />
| chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc<br />
| x-gbk<br />
|<br />
|-<br />
| GEOSTD8<br />
| geostd8<br />
|<br />
| Does not seem to work.<br />
|-<br />
| HZ-GB-2312<br />
| hz-gb-2312<br />
|<br />
|<br />
|-<br />
| IBM850<br />
| 850, cp850, csIBM850, ibm850<br />
|<br />
| csIBM850 not recognised.<br />
|-<br />
| IBM852<br />
| 852, cp852, csIBM852, ibm852<br />
|<br />
| csIBM852 not recognised.<br />
|-<br />
| IBM855<br />
| 855, cp855, csIBM855, ibm855<br />
|<br />
| csIBM855 not recognised.<br />
|-<br />
| IBM857<br />
| 857, cp857, csIBM857, ibm857<br />
|<br />
| csIBM857 not recognised.<br />
|-<br />
| IBM862<br />
| 862, cp862, csIBM862, ibm862<br />
|<br />
| csIBM862 not recognised.<br />
|-<br />
| IBM864<br />
| 864, cp864, csIBM864, ibm-864, ibm864<br />
|<br />
| csIBM864 not recognised.<br />
|-<br />
| IBM864i<br />
| 864i, cp864i, csibm864i, ibm-864i, ibm864i<br />
|<br />
|<br />
|-<br />
| IBM866<br />
| 866, cp-866, cp866, csIBM866, ibm866<br />
|<br />
| csIBM866 not recognised.<br />
|-<br />
| ISO-2022-CN<br />
| iso-2022-cn, iso-2022-cn-ext<br />
|<br />
|<br />
|-<br />
| ISO-2022-JP<br />
| csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2<br />
|<br />
|<br />
|-<br />
| ISO-2022-KR<br />
| csiso2022kr, iso-2022-kr<br />
|<br />
|<br />
|-<br />
| ISO-8859-1<br />
| cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1<br />
| windows-1252<br />
|<br />
|-<br />
| ISO-8859-10<br />
| csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6<br />
|<br />
|<br />
|-<br />
| ISO-8859-11<br />
| iso-8859-11, iso8859-11, iso885911<br />
| windows-874<br />
|<br />
|-<br />
| ISO-8859-13<br />
| iso-8859-13, iso8859-13, iso885913<br />
|<br />
|<br />
|-<br />
| ISO-8859-14<br />
| iso-8859-14, iso8859-14, iso885914<br />
|<br />
|<br />
|-<br />
| ISO-8859-15<br />
| iso-8859-15, iso8859-15, iso885915, iso_8859-15<br />
|<br />
|<br />
|-<br />
| ISO-8859-16<br />
| iso-8859-16<br />
|<br />
|<br />
|-<br />
| ISO-8859-2<br />
| csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2<br />
|<br />
|<br />
|-<br />
| ISO-8859-3<br />
| csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3<br />
|<br />
|<br />
|-<br />
| ISO-8859-4<br />
| csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4<br />
|<br />
|<br />
|-<br />
| ISO-8859-5<br />
| csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5<br />
|<br />
|<br />
|-<br />
| ISO-8859-6<br />
| arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6<br />
|<br />
|<br />
|-<br />
| ISO-8859-6-E<br />
| csiso88596e, iso-8859-6-e<br />
|<br />
|<br />
|-<br />
| ISO-8859-6-I<br />
| csiso88596i, iso-8859-6-i<br />
|<br />
|<br />
|-<br />
| ISO-8859-7<br />
| csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek<br />
|<br />
|<br />
|-<br />
| ISO-8859-8<br />
| csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual<br />
|<br />
|<br />
|-<br />
| ISO-8859-8-E<br />
| csiso88598e, iso-8859-8-e<br />
|<br />
|<br />
|-<br />
| ISO-8859-8-I<br />
| csiso88598i, iso-8859-8-i, iso-8859-8i<br />
|<br />
|<br />
|-<br />
| ISO-8859-9<br />
| csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5<br />
|<br />
|<br />
|-<br />
| ISO-IR-111<br />
| csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111<br />
|<br />
|<br />
|-<br />
| KOI8-R<br />
| koi8-r<br />
|<br />
|<br />
|-<br />
| KOI8-U<br />
| koi8-u<br />
|<br />
|<br />
|-<br />
| Shift_JIS<br />
| csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis<br />
|<br />
|<br />
|-<br />
| T.61-8bit<br />
| csiso103t618bit, iso-ir-103, t.61, t.61-8bit<br />
|<br />
|<br />
|-<br />
| TIS-620<br />
| tis-620, tis620<br />
| windows-874<br />
|<br />
|-<br />
| us-ascii<br />
| 646, ansi_x3.4-1968, ascii, us-ascii<br />
| windows-1252<br />
|<br />
|-<br />
| UTF-16<br />
| utf-16<br />
|<br />
| Recognized as BE or LE by BOM or byte sniffing<br />
|-<br />
| UTF-16BE<br />
| csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be<br />
|<br />
|<br />
|-<br />
| UTF-16LE<br />
| utf-16le, x-iso-10646-ucs-2-le<br />
|<br />
|<br />
|-<br />
| UTF-32<br />
| utf-32<br />
|<br />
| Recognized as BE or LE by BOM or byte sniffing<br />
|-<br />
| UTF-32BE<br />
| iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be<br />
|<br />
|<br />
|-<br />
| UTF-32LE<br />
| utf-32le, x-iso-10646-ucs-4-le<br />
|<br />
|<br />
|-<br />
| UTF-7<br />
| csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7<br />
|<br />
|<br />
|-<br />
| UTF-8<br />
| unicode-1-1-utf-8, utf-8, utf8<br />
|<br />
|<br />
|-<br />
| VISCII<br />
| csviscii, viscii<br />
|<br />
|<br />
|-<br />
| windows-1250<br />
| cp1250, windows-1250, x-cp1250<br />
|<br />
|<br />
|-<br />
| windows-1251<br />
| ansi-1251, cp1251, windows-1251, x-cp1251<br />
|<br />
|<br />
|-<br />
| windows-1252<br />
| cp1252, windows-1252, x-cp1252<br />
|<br />
|<br />
|-<br />
| windows-1253<br />
| cp1253, windows-1253, x-cp1253<br />
|<br />
|<br />
|-<br />
| windows-1254<br />
| cp1254, windows-1254, x-cp1254<br />
|<br />
|<br />
|-<br />
| windows-1255<br />
| cp1255, windows-1255, x-cp1255<br />
|<br />
|<br />
|-<br />
| windows-1256<br />
| cp1256, windows-1256, x-cp1256<br />
|<br />
|<br />
|-<br />
| windows-1257<br />
| cp1257, windows-1257, x-cp1257<br />
|<br />
|<br />
|-<br />
| windows-1258<br />
| cp1258, windows-1258, x-cp1258<br />
|<br />
|<br />
|-<br />
| windows-874<br />
| ibm874, windows-874<br />
|<br />
|<br />
|-<br />
| windows-936<br />
| windows-936<br />
|<br />
|<br />
|-<br />
| x-euc-tw<br />
| cns11643, x-euc-tw, zh_tw-euc<br />
|<br />
|<br />
|-<br />
| x-gbk<br />
| gbk, x-gbk<br />
|<br />
|<br />
|-<br />
| x-imap4-modified-utf7<br />
| x-imap4-modified-utf7<br />
|<br />
|<br />
|-<br />
| x-johab<br />
| x-johab<br />
|<br />
|<br />
|-<br />
| x-mac-arabic<br />
| x-mac-arabic<br />
|<br />
|<br />
|-<br />
| x-mac-ce<br />
| x-mac-ce<br />
|<br />
|<br />
|-<br />
| x-mac-croatian<br />
| x-mac-croatian<br />
|<br />
|<br />
|-<br />
| x-mac-cyrillic<br />
| x-mac-cyrillic<br />
|<br />
|<br />
|-<br />
| x-mac-devanagari<br />
| x-mac-devanagari<br />
|<br />
|<br />
|-<br />
| x-mac-farsi<br />
| x-mac-farsi<br />
|<br />
|<br />
|-<br />
| x-mac-greek<br />
| x-mac-greek<br />
|<br />
|<br />
|-<br />
| x-mac-gujarati<br />
| x-mac-gujarati<br />
|<br />
|<br />
|-<br />
| x-mac-gurmukhi<br />
| x-mac-gurmukhi<br />
|<br />
|<br />
|-<br />
| x-mac-hebrew<br />
| x-mac-hebrew<br />
|<br />
|<br />
|-<br />
| x-mac-icelandic<br />
| x-mac-icelandic<br />
|<br />
|<br />
|-<br />
| x-mac-roman<br />
| csMacintosh, mac, macintosh, x-mac-roman<br />
|<br />
| csMacintosh not recognised.<br />
|-<br />
| x-mac-romanian<br />
| x-mac-romanian<br />
|<br />
|<br />
|-<br />
| x-mac-turkish<br />
| x-mac-turkish<br />
|<br />
|<br />
|-<br />
| x-mac-ukrainian<br />
| x-mac-ukrainian<br />
|<br />
|<br />
|-<br />
| x-user-defined<br />
| x-user-defined<br />
|<br />
|<br />
|-<br />
| x-viet-tcvn5712<br />
| x-viet-tcvn5712<br />
|<br />
|<br />
|-<br />
| x-viet-vps<br />
| x-viet-vps<br />
|<br />
|<br />
|-<br />
| x-windows-949<br />
| ks_c_5601-1987, x-windows-949<br />
|<br />
|<br />
|}<br />
<br />
Table generated from <http://mxr.mozilla.org/mozilla1.9.1/source/intl/uconv/src/charsetalias.properties> (corresponds to Firefox 3.5.2).<br />
<br />
Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.<br />
<br />
8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.<br />
<br />
'''Missing:''' Mappings for CJK encodings. Information about the encodings found in charsetalias.properties, but apparently not working as expected.<br />
<br />
'''Bugs:''' Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.<br />
<br />
=== Chrome ===<br />
<br />
FIXME<br />
<br />
=== Internet Explorer ===<br />
<br />
Needs sorting out:<br />
<br />
* http://blogs.msdn.com/shawnste/archive/2009/08/18/alternate-encoding-names-recognized-by-net-ie.aspx<br />
* http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx<br />
* http://web.archive.org/web/20080204211015/http://www.hitachi-to.co.jp/prod/prod_2/inter/emk/help/TextEncoder/CodePage.htm<br />
<br />
=== Safari ===<br />
<br />
Apple’s reference <http://developer.apple.com/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_encnames/tecmgr_encnames.html#//apple_ref/doc/uid/TP40000932-CH204-TPXREF103> is incomplete or at least severely out of date even as a description of the Text Encoding Conversion Manager (TECM).<br />
<br />
Querying TECM programmatically gives labels for more encodings, but the set of encodings does not match what Safari currently supports.<br />
<br />
According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.<br />
<br />
ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.<br />
<br />
'''Does encoding support in Safari depend on OS and OS version, and will Safari ultimately support everything in ICU 4.2?'''</div>
Smontagu