A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
Web Encodings: Difference between revisions
m (Wording) |
(add some thoughts on what to do) |
||
Line 13: | Line 13: | ||
* Get the new rules implemented | * Get the new rules implemented | ||
Documenting more exactly the encoding and decoding algorithms for each encoding and getting that implemented interoperably would also be great. | Documenting more exactly the encoding (Unicode stream + encoding -> byte stream) and decoding (byte stream + encoding -> Unicode stream) algorithms for each encoding and getting that implemented interoperably would also be great. | ||
== Current Implementations == | == Current Implementations == | ||
Line 1,561: | Line 1,561: | ||
Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added. | Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added. | ||
== Thoughts (Anne) == | |||
If it can be agreed upon that all non-UTF-8 and non-UTF-16 encodings are legacy encodings I personally would not mind advocating that we should drop support for US-ASCII and ISO-8859-1 completely in favor of Windows-1252 (and do the same for similar situations). I.e. that US-ASCII and ISO-8859-1 labels simply map to Windows-1252. This should simplify code a little bit as well. | |||
I also think that we should ban UTF-7, UTF-32 and all EBCDIC encodings. This is already mostly done by HTML5. |
Revision as of 14:44, 11 September 2009
Attempt at fixing the Web encoding problem.
Goals
- Document existing practices by describing for each browser
- The list of supported encodings.
- The list of supported labels for those encodings.
- The matching algorithm for labels.
- Converge the various used algorithms by
- Defining a list of encodings everyone has to support. Browsers must not support more encodings than on that list.
- Defining a list of supported labels for those encodings. Browsers must not support more labels than on that list.
- Defining the matching algorithm. (HTML5 has been updated with a better one now.)
- Get the new rules implemented
Documenting more exactly the encoding (Unicode stream + encoding -> byte stream) and decoding (byte stream + encoding -> Unicode stream) algorithms for each encoding and getting that implemented interoperably would also be great.
Current Implementations
Does this differ per platform? Opera might differ a bit on Mac.
Data
Integrate this awesome data somehow:
- http://coq.no/character-tables/mime/en
- http://coq.no/character-tables/mime/iso-2022/en
- http://coq.no/character-tables/mime/euc/en
- http://coq.no/character-tables/mime/locale-specific/en
Opera
Matching
UTS22 and strips leading x characters. (For now, plan is to switch to removing leading and trailing whitespace and ASCII case-insensitive matching afterwards in the future.)
Encodings
Encoding | Labels | Decoded As | Notes |
---|---|---|---|
big5 | big5, cnbig5, csbig5 | ||
big5-hkscs | big5hkscs | ||
euc-jp | cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese | ||
euc-kr | cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949 | ||
euc-tw | euctw | ||
gb18030 | gb18030 | ||
gbk | chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936 | ||
hz-gb-2312 | hzgb2312 | ||
ibm866 | 866, cp866, csibm866, ibm866 | ||
iso-2022-cn | iso2022cn | ||
iso-2022-jp | csiso2022jp, iso2022jp | ||
iso-2022-jp-1 | iso2022jp1 | ||
iso-2022-kr | csiso2022kr, iso2022kr | ||
iso-8859-1 | cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1 | windows-1252 | |
iso-8859-2 | csisolatin2, iso88592, iso885921987, isoir101, l2, latin2 | ||
iso-8859-3 | csisolatin3, iso88593, iso885931988, isoir109, l3, latin3 | ||
iso-8859-4 | csisolatin4, iso88594, iso885941988, isoir110, l4, latin4 | ||
iso-8859-5 | csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144 | ||
iso-8859-6 | arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127 | ||
iso-8859-6-i | csiso88596i, iso88596i | ||
iso-8859-7 | csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126 | ||
iso-8859-8 | csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual | ||
iso-8859-8-i | csiso88598i, iso88598i | ||
iso-8859-9 | csisolatin5, iso88599, iso885991989, isoir148, l5, latin5 | ||
iso-8859-10 | csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6 | ||
iso-8859-11 | iso885911, tis620, tis6202533, windows874 | Actually implemented as windows-874 | |
iso-8859-13 | iso885913 | ||
iso-8859-14 | iso885914, iso8859141998, isoceltic, isoir199, l8, latin8 | ||
iso-8859-15 | iso885915, latin9 | ||
iso-8859-16 | iso885916, iso8859162001, isoir226, l10, latin10 | ||
koi8-r | cskoi8r, koi8r | ||
koi8-u | koi8u | ||
macintosh | csmacintosh, mac, macintosh, macroman | Likely disabled. | |
shift_jis | cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j | ||
tcvn | tcvn, viettcvn | ||
us-ascii | ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii | windows-1252 | |
utf-16 | csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16 | ||
utf-16be | utf16be | ||
utf-16le | utf16le | ||
utf-8 | utf8 | ||
viscii | csviscii, viscii | ||
windows-1250 | cp1250, microsoftcp1250, windows1250 | ||
windows-1251 | cp1251, microsoftcp1251, windows1251 | ||
windows-1252 | cp1252, microsoftcp1252, windows1252 | ||
windows-1253 | cp1253, microsoftcp1253, windows1253 | ||
windows-1254 | cp1254, microsoftcp1254, windows1254 | ||
windows-1255 | cp1255, microsoftcp1255, windows1255 | ||
windows-1256 | cp1256, microsoftcp1256, windows1256 | ||
windows-1257 | cp1257, microsoftcp1257, windows1257 | ||
windows-1258 | cp1258, microsoftcp1258, windows1258 | ||
windows-sami-2 | samiws2, windowssami2, ws2 | ||
x-mac-ce | macce | Likely disabled. | |
x-mac-cyrillic | maccyrillic | Likely disabled. | |
x-mac-greek | macgreek | Likely disabled. | |
x-mac-turkish | macturkish | Likely disabled. | |
x-vps | vps |
Firefox
Matching
ASCII lowercasing.
Encodings
Encoding | Labels | Decoded As | Notes |
---|---|---|---|
armscii-8 | armscii-8 | ||
Big5 | big5, csbig5, x-x-big5, zh_tw-big5 | ||
Big5-HKSCS | big5-hkscs | ||
EUC-JP | cseucjpkdfmtjapanese, euc-jp, x-euc-jp | ||
EUC-KR | 5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601 | x-windows-949 | This converter is assymetric. In ToUnicode direction, it is generous and acts as Windows-949. It also supports 8-byte sequences for 8,822 Hangul syllables not encoded as precomposed forms in KS X 1001. In FromUnicode direction, it is strict and generate 8-byte sequences for those 8,822 Hangul syllables instead of 2-byte sequences in windows-949. |
gb18030 | gb18030 | ||
GB2312 | chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc | x-gbk | |
GEOSTD8 | geostd8 | Does not seem to work. | |
HZ-GB-2312 | hz-gb-2312 | ||
IBM850 | 850, cp850, csIBM850, ibm850 | csIBM850 not recognised. | |
IBM852 | 852, cp852, csIBM852, ibm852 | csIBM852 not recognised. | |
IBM855 | 855, cp855, csIBM855, ibm855 | csIBM855 not recognised. | |
IBM857 | 857, cp857, csIBM857, ibm857 | csIBM857 not recognised. | |
IBM862 | 862, cp862, csIBM862, ibm862 | csIBM862 not recognised. | |
IBM864 | 864, cp864, csIBM864, ibm-864, ibm864 | csIBM864 not recognised. | |
IBM864i | 864i, cp864i, csibm864i, ibm-864i, ibm864i | ||
IBM866 | 866, cp-866, cp866, csIBM866, ibm866 | csIBM866 not recognised. | |
ISO-2022-CN | iso-2022-cn, iso-2022-cn-ext | ||
ISO-2022-JP | csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2 | ||
ISO-2022-KR | csiso2022kr, iso-2022-kr | ||
ISO-8859-1 | cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 | windows-1252 | |
ISO-8859-10 | csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6 | ||
ISO-8859-11 | iso-8859-11, iso8859-11, iso885911 | windows-874 | |
ISO-8859-13 | iso-8859-13, iso8859-13, iso885913 | ||
ISO-8859-14 | iso-8859-14, iso8859-14, iso885914 | ||
ISO-8859-15 | iso-8859-15, iso8859-15, iso885915, iso_8859-15 | ||
ISO-8859-16 | iso-8859-16 | ||
ISO-8859-2 | csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2 | ||
ISO-8859-3 | csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3 | ||
ISO-8859-4 | csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4 | ||
ISO-8859-5 | csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5 | ||
ISO-8859-6 | arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6 | ||
ISO-8859-6-E | csiso88596e, iso-8859-6-e | ||
ISO-8859-6-I | csiso88596i, iso-8859-6-i | ||
ISO-8859-7 | csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek | ||
ISO-8859-8 | csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual | ||
ISO-8859-8-E | csiso88598e, iso-8859-8-e | ||
ISO-8859-8-I | csiso88598i, iso-8859-8-i, iso-8859-8i | ||
ISO-8859-9 | csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5 | ||
ISO-IR-111 | csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111 | ||
KOI8-R | koi8-r | ||
KOI8-U | koi8-u | ||
Shift_JIS | csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis | ||
T.61-8bit | csiso103t618bit, iso-ir-103, t.61, t.61-8bit | ||
TIS-620 | tis-620, tis620 | windows-874 | |
us-ascii | 646, ansi_x3.4-1968, ascii, us-ascii | windows-1252 | |
UTF-16 | utf-16 | Recognized as BE or LE by BOM or byte sniffing | |
UTF-16BE | csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be | ||
UTF-16LE | utf-16le, x-iso-10646-ucs-2-le | ||
UTF-32 | utf-32 | Recognized as BE or LE by BOM or byte sniffing | |
UTF-32BE | iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be | ||
UTF-32LE | utf-32le, x-iso-10646-ucs-4-le | ||
UTF-7 | csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7 | ||
UTF-8 | unicode-1-1-utf-8, utf-8, utf8 | ||
VISCII | csviscii, viscii | ||
windows-1250 | cp1250, windows-1250, x-cp1250 | ||
windows-1251 | ansi-1251, cp1251, windows-1251, x-cp1251 | ||
windows-1252 | cp1252, windows-1252, x-cp1252 | ||
windows-1253 | cp1253, windows-1253, x-cp1253 | ||
windows-1254 | cp1254, windows-1254, x-cp1254 | ||
windows-1255 | cp1255, windows-1255, x-cp1255 | ||
windows-1256 | cp1256, windows-1256, x-cp1256 | ||
windows-1257 | cp1257, windows-1257, x-cp1257 | ||
windows-1258 | cp1258, windows-1258, x-cp1258 | ||
windows-874 | ibm874, windows-874 | ||
windows-936 | windows-936 | ||
x-euc-tw | cns11643, x-euc-tw, zh_tw-euc | ||
x-gbk | gbk, x-gbk | ||
x-imap4-modified-utf7 | x-imap4-modified-utf7 | ||
x-johab | x-johab | ||
x-mac-arabic | x-mac-arabic | ||
x-mac-ce | x-mac-ce | ||
x-mac-croatian | x-mac-croatian | ||
x-mac-cyrillic | x-mac-cyrillic | ||
x-mac-devanagari | x-mac-devanagari | ||
x-mac-farsi | x-mac-farsi | ||
x-mac-greek | x-mac-greek | ||
x-mac-gujarati | x-mac-gujarati | ||
x-mac-gurmukhi | x-mac-gurmukhi | ||
x-mac-hebrew | x-mac-hebrew | ||
x-mac-icelandic | x-mac-icelandic | ||
x-mac-roman | csMacintosh, mac, macintosh, x-mac-roman | csMacintosh not recognised. | |
x-mac-romanian | x-mac-romanian | ||
x-mac-turkish | x-mac-turkish | ||
x-mac-ukrainian | x-mac-ukrainian | ||
x-user-defined | x-user-defined | ||
x-viet-tcvn5712 | x-viet-tcvn5712 | ||
x-viet-vps | x-viet-vps | ||
x-windows-949 | ks_c_5601-1987, x-windows-949 |
Table generated from <http://mxr.mozilla.org/mozilla1.9.1/source/intl/uconv/src/charsetalias.properties> (corresponds to Firefox 3.5.2).
Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.
8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.
Bugs: Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.
Internet Explorer
Matching
Strips leading and trailing whitespace and then does ASCII(?) case-insensitive matching. (Matches HTML5.)
Encodings
Encoding | Labels | Decoded As | Notes |
---|---|---|---|
(???) | cp930 | (code page: 50930) | |
(???) | cp933 | (code page: 50933) | |
(???) | cp935 | (code page: 50935) | |
(???) | cp937 | (code page: 50937) | |
(???) | cp939 | (code page: 50939) | |
(???) | x-cp21027 | (code page: 21027) | |
(???) | x-cp50229 | (code page: 50229) | |
(???) | x-ebcdic-japaneseanduscanada | (code page: 50931) | |
asmo-708 | asmo-708 | (code page: 708) | |
big5 | big5, big5-hkscs, cn-big5, csbig5, x-x-big5 | (code page: 950) | |
cp1025 | cp1025 | (code page: 21025) EBCDIC Cyrilllic Multilingual | |
cp866 | cp866, ibm866 | (code page: 866) | |
cp875 | cp875 | (code page: 875) EBCDIC Greece | |
csiso2022jp | csiso2022jp | (code page: 50221) | |
dos-720 | dos-720 | (code page: 720) | |
dos-862 | cp862, dos-862, ibm862 | (code page: 862) | |
euc-cn | euc-cn, x-euc-cn | (code page: 51936) | |
euc-jp | cseucpkdfmtjapanese, euc-jp, extended_unix_code_packed_format_for_japanese, iso-2022-jpeuc, x-euc, x-euc-jp | (code page: 51932) | |
euc-kr | cseuckr, euc-kr, iso-2022-kr-8, iso-2022-kr-8bit | (code page: 51949) | |
gb18030 | gb18030 | (code page: 54936) | |
gb2312 | chinese, cn-gb, csgb2312, csgb231280, csiso58gb231280, gb2312, gb2312-80, gb231280, gbk, gb_2312-80, iso-ir-58 | (code page: 936) | |
hz-gb-2312 | hz-gb-2312 | (code page: 52936) | |
ibm-thai | csibmthai, ibm-thai | (code page: 20838) EBCDIC Thailand | |
ibm00858 | ccsid00858, cp00858, cp858, ibm00858, pc-multilingual-850+euro | (code page: 858) | |
ibm00924 | ccsid00924, cp00924, ebcdic-latin9--euro, ibm00924 | (code page: 20924) EBCDIC Latin 9 | |
ibm01047 | ibm01047 | (code page: 1047) EBCDIC Latin 1/Open Systems | |
ibm01140 | ccsid01140, cp01140, ebcdic-us-37+euro, ibm01140 | (code page: 1140) EBCDIC USA, Canada, etc. ECECP | |
ibm01141 | ccsid01141, cp01141, ebcdic-de-273+euro, ibm01141 | (code page: 1141) EBCDIC Austria, Germany ECECP | |
ibm01142 | ccsid01142, cp01142, ebcdic-dk-277+euro, ebcdic-no-277+euro, ibm01142 | (code page: 1142) EBCDIC Denmark, Norway ECECP | |
ibm01143 | ccsid01143, cp01143, ebcdic-fi-278+euro, ebcdic-se-278+euro, ibm01143 | (code page: 1143) EBCDIC Finland, Sweden ECECP | |
ibm01144 | ccsid01144, cp01144, ebcdic-it-280+euro, ibm01144 | (code page: 1144) EBCDIC Italy ECECP | |
ibm01145 | ccsid01145, cp01145, ebcdic-es-284+euro, ibm01145 | (code page: 1145) EBCDIC Spain, Latin America (Spanish) | |
ibm01146 | ccsid01146, cp01146, ebcdic-gb-285+euro, ibm01146 | (code page: 1146) EBCDIC UK ECECP | |
ibm01147 | ccsid01147, cp01147, ebcdic-fr-297+euro, ibm01147 | (code page: 1147) EBCDIC France ECECP | |
ibm01148 | ccsid01148, cp01148, ebcdic-international-500+euro, ibm01148 | (code page: 1148) EBCDIC International ECECP | |
ibm01149 | ccsid01149, cp01149, ebcdic-is-871+euro, ibm01149 | (code page: 1149) EBCDIC Iceland ECECP | |
ibm037 | cp037, csibm037, ebcdic-cp-ca, ebcdic-cp-nl, ebcdic-cp-us, ebcdic-cp-wt, ibm037 | (code page: 37) EBCDIC USA/Canada - CECP | |
ibm1026 | cp1026, csibm1026, ibm1026 | (code page: 1026) EBCDIC Latin #5 - Turkey | |
ibm273 | cp273, csibm273, ibm273 | (code page: 20273) EBCDIC Germany F.R./Austria - CECP | |
ibm277 | csibm277, ebcdic-cp-dk, ebcdic-cp-no, ibm277 | (code page: 20277) EBCDIC Denmark, Norway - CECP | |
ibm278 | cp278, csibm278, ebcdic-cp-fi, ebcdic-cp-se, ibm278 | (code page: 20278) EBCDIC Finland, Sweden - CECP | |
ibm280 | cp280, csibm280, ebcdic-cp-it, ibm280 | (code page: 20280) EBCDIC Italy - CECP | |
ibm284 | cp284, csibm284, ebcdic-cp-es, ibm284 | (code page: 20284) EBCDIC Spain/Latin America - CECP | |
ibm285 | cp285, csibm285, ebcdic-cp-gb, ibm285 | (code page: 20285) EBCDIC United Kingdom - CECP | |
ibm290 | cp290, csibm290, ebcdic-jp-kana, ibm290 | (code page: 20290) EBCDIC Japanese (Katakana) Extended | |
ibm297 | cp297, csibm297, ebcdic-cp-fr, ibm297 | (code page: 20297) EBCDIC France - CECP | |
ibm420 | cp420, csibm420, ebcdic-cp-ar1, ibm420 | (code page: 20420) EBCDIC Arabic Bilingual | |
ibm423 | cp423, csibm423, ebcdic-cp-gr, ibm423 | (code page: 20423) EBCDIC Greece - 183 | |
ibm424 | cp424, csibm424, ebcdic-cp-he, ibm424 | (code page: 20424) EBCDIC Israel (Hebrew) | |
ibm437 | 437, cp437, cspc8codepage437, ibm437 | (code page: 437) | |
ibm500 | cp500, csibm500, ebcdic-cp-be, ebcdic-cp-ch, ibm500 | (code page: 500) EBCDIC International #5 | |
ibm737 | ibm737 | (code page: 737) | |
ibm775 | ibm775 | (code page: 775) | |
ibm850 | cp850, ibm850 | (code page: 850) | |
ibm852 | cp852, ibm852 | (code page: 852) | |
ibm855 | cp855, ibm855 | (code page: 855) | |
ibm857 | cp857, ibm857 | (code page: 857) | |
ibm860 | cp860, ibm860 | (code page: 860) | |
ibm861 | cp861, ibm861 | (code page: 861) | |
ibm863 | cp863, ibm863 | (code page: 863) | |
ibm864 | cp864, ibm864 | (code page: 864) | |
ibm865 | cp865, ibm865 | (code page: 865) | |
ibm869 | cp869, ibm869 | (code page: 869) | |
ibm870 | cp870, csibm870, ebcdic-cp-roece, ebcdic-cp-yu, ibm870 | (code page: 870) EBCDIC Latin 2, Multilingual | |
ibm871 | cp871, csibm871, ebcdic-cp-is, ibm871 | (code page: 20871) EBCDIC Iceland | |
ibm880 | cp880, csibm880, ebcdic-cyrillic, ibm880 | (code page: 20880) EBCDIC Cyrillic, Multilingual | |
ibm905 | cp905, csibm905, ebcdic-cp-tr, ibm905 | (code page: 20905) EBCDIC Latin 3 | |
iso-2022-jp | iso-2022-jp | (code page: 50220) | |
iso-2022-kr | csiso2022kr, iso-2022-kr, iso-2022-kr-7, iso-2022-kr-7bit | (code page: 50225) | |
iso-8859-1 | cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1 | (code page: 28591) | |
iso-8859-13 | iso-8859-13 | (code page: 28603) | |
iso-8859-15 | csisolatin9, iso-8859-15, iso_8859-15, l9, latin9 | (code page: 28605) | |
iso-8859-2 | csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 | (code page: 28592) | |
iso-8859-3 | csisolatin3, iso-8859-3, iso-ir-109, iso_8859-3, iso_8859-3:1988, l3, latin3 | (code page: 28593) | |
iso-8859-4 | csisolatin4, iso-8859-4, iso-ir-110, iso_8859-4, iso_8859-4:1988, l4, latin4 | (code page: 28594) | |
iso-8859-5 | csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso_8859-5, iso_8859-5:1988 | (code page: 28595) | |
iso-8859-6 | arabic, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso_8859-6, iso_8859-6:1987 | (code page: 28596) | |
iso-8859-7 | csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso_8859-7, iso_8859-7:1987 | (code page: 28597) | |
iso-8859-8 | csisolatinhebrew, hebrew, iso-8859-8, iso-8859-8 visual, iso-ir-138, iso_8859-8, iso_8859-8:1988, logical, visual | (code page: 28598) | |
iso-8859-8-i | iso-8859-8-i | (code page: 38598) | |
iso-8859-9 | csisolatin5, iso-8859-9, iso-ir-148, iso_8859-9, iso_8859-9:1989, l5, latin5 | (code page: 28599) | |
johab | johab | (code page: 1361) | |
koi8-r | cskoi8r, koi, koi8, koi8-r, koi8r | (code page: 20866) | |
koi8-u | koi8-ru, koi8-u | (code page: 21866) | |
ks_c_5601-1987 | csksc56011987, iso-ir-149, korean, ks-c-5601, ks-c5601, ksc5601, ksc_5601, ks_c_5601, ks_c_5601-1987, ks_c_5601-1989, ks_c_5601_1987 | (code page: 949) | |
macintosh | macintosh | (code page: 10000) | |
shift_jis | csshiftjis, cswindows31j, ms_kanji, shift-jis, shift_jis, sjis, windows-31j, x-ms-cp932, x-sjis | (code page: 932) | |
unicodefffe | unicodefffe, utf-16be | (code page: 1201) | |
us-ascii | ansi_x3.4-1968, ansi_x3.4-1986, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii | (code page: 20127) | |
utf-16 | iso-10646-ucs-2, ucs-2, unicode, utf-16, utf-16le | (code page: 1200) | |
utf-7 | csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-1-1-utf-7, x-unicode-2-0-utf-7 | (code page: 65000) | |
utf-8 | unicode-1-1-utf-8, unicode-2-0-utf-8, utf-8, x-unicode-1-1-utf-8, x-unicode-2-0-utf-8 | (code page: 65001) | |
windows-1250 | windows-1250, x-cp1250 | (code page: 1250) | |
windows-1251 | windows-1251, x-cp1251 | (code page: 1251) | |
windows-1252 | windows-1252, x-ansi | (code page: 1252) | |
windows-1253 | windows-1253 | (code page: 1253) | |
windows-1254 | windows-1254 | (code page: 1254) | |
windows-1255 | windows-1255 | (code page: 1255) | |
windows-1256 | cp1256, windows-1256 | (code page: 1256) | |
windows-1257 | windows-1257 | (code page: 1257) | |
windows-1258 | windows-1258 | (code page: 1258) | |
windows-874 | dos-874, iso-8859-11, tis-620, windows-874 | (code page: 874) | |
x-chinese-cns | x-chinese-cns | (code page: 20000) | |
x-chinese-eten | x-chinese-eten | (code page: 20002) | |
x-cp20001 | x-cp20001 | (code page: 20001) | |
x-cp20003 | x-cp20003 | (code page: 20003) | |
x-cp20004 | x-cp20004 | (code page: 20004) | |
x-cp20005 | x-cp20005 | (code page: 20005) | |
x-cp20261 | x-cp20261 | (code page: 20261) | |
x-cp20269 | x-cp20269 | (code page: 20269) | |
x-cp20936 | x-cp20936 | (code page: 20936) | |
x-cp20949 | x-cp20949 | (code page: 20949) | |
x-cp50227 | x-cp50227 | (code page: 50227) | |
x-ebcdic-koreanextended | x-ebcdic-koreanextended | (code page: 20833) | |
x-europa | x-europa | (code page: 29001) | |
x-ia5 | irv, x-ia5 | (code page: 20105) | |
x-ia5-german | din_66003, german, x-ia5-german | (code page: 20106) | |
x-ia5-norwegian | norwegian, ns_4551-1, x-ia5-norwegian | (code page: 20108) | |
x-ia5-swedish | sen_850200_b, swedish, x-ia5-swedish | (code page: 20107) | |
x-iscii-as | x-iscii-as | (code page: 57006) | |
x-iscii-be | x-iscii-be | (code page: 57003) | |
x-iscii-de | x-iscii-de | (code page: 57002) | |
x-iscii-gu | x-iscii-gu | (code page: 57010) | |
x-iscii-ka | x-iscii-ka | (code page: 57008) | |
x-iscii-ma | x-iscii-ma | (code page: 57009) | |
x-iscii-or | x-iscii-or | (code page: 57007) | |
x-iscii-pa | x-iscii-pa | (code page: 57011) | |
x-iscii-ta | x-iscii-ta | (code page: 57004) | |
x-iscii-te | x-iscii-te | (code page: 57005) | |
x-mac-arabic | x-mac-arabic | (code page: 10004) | |
x-mac-ce | x-mac-ce | (code page: 10029) | |
x-mac-chinesesimp | x-mac-chinesesimp | (code page: 10008) | |
x-mac-chinesetrad | x-mac-chinesetrad | (code page: 10002) | |
x-mac-croatian | x-mac-croatian | (code page: 10082) | |
x-mac-cyrillic | x-mac-cyrillic | (code page: 10007) | |
x-mac-greek | x-mac-greek | (code page: 10006) | |
x-mac-hebrew | x-mac-hebrew | (code page: 10005) | |
x-mac-icelandic | x-mac-icelandic | (code page: 10079) | |
x-mac-japanese | x-mac-japanese | (code page: 10001) | |
x-mac-korean | x-mac-korean | (code page: 10003) | |
x-mac-romanian | x-mac-romanian | (code page: 10010) | |
x-mac-thai | x-mac-thai | (code page: 10021) | |
x-mac-turkish | x-mac-turkish | (code page: 10081) | |
x-mac-ukrainian | x-mac-ukrainian | (code page: 10017) |
Needs sorting out:
- http://blogs.msdn.com/shawnste/archive/2009/08/18/alternate-encoding-names-recognized-by-net-ie.aspx
- http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx
- http://web.archive.org/web/20080204211015/http://www.hitachi-to.co.jp/prod/prod_2/inter/emk/help/TextEncoder/CodePage.htm
Safari
Matching
UTS22
Encodings
Apple’s reference <http://developer.apple.com/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_encnames/tecmgr_encnames.html#//apple_ref/doc/uid/TP40000932-CH204-TPXREF103> is incomplete or at least severely out of date even as a description of the Text Encoding Conversion Manager (TECM).
Querying TECM programmatically gives labels for more encodings, but the set of encodings does not match what Safari currently supports.
According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.
ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.
Does encoding support in Safari depend on OS and OS version, and will Safari ultimately support everything in ICU 4.2? -- Windows comes with ICU but does not support some of the encodings the Mac edition does.
The version of ICU used by Mac Safari depends on the OS version. For instance, 10.5 (Leopard) comes with ICU 3.6(?) and Snowleopard probably comes with 4.0(?). Windows Safari is currently shipped with ICU 4.0.
Integrate notes from othermaciej in http://krijnhoetmer.nl/irc-logs/whatwg/20090909 Also: http://trac.webkit.org/browser/trunk/WebCore/platform/text
Chrome
Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added.
Thoughts (Anne)
If it can be agreed upon that all non-UTF-8 and non-UTF-16 encodings are legacy encodings I personally would not mind advocating that we should drop support for US-ASCII and ISO-8859-1 completely in favor of Windows-1252 (and do the same for similar situations). I.e. that US-ASCII and ISO-8859-1 labels simply map to Windows-1252. This should simplify code a little bit as well.
I also think that we should ban UTF-7, UTF-32 and all EBCDIC encodings. This is already mostly done by HTML5.