A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Web Encodings

From WHATWG Wiki
Revision as of 13:34, 6 January 2011 by Ms2ger (talk | contribs) (→‎Encodings: Add link)
Jump to navigation Jump to search

Attempt at fixing the Web encoding problem.

Goals

  • Document existing practices by describing for each browser
    • The list of supported encodings.
    • The list of supported labels for those encodings.
    • The matching algorithm for labels.
  • Converge the various used algorithms by
    • Defining a list of encodings everyone has to support. Browsers must not support more encodings than on that list.
    • Defining a list of supported labels for those encodings. Browsers must not support more labels than on that list.
    • Defining the matching algorithm. (HTML5 has been updated with a better one now.)
  • Get the new rules implemented

Documenting more exactly the encoding (Unicode stream + encoding -> byte stream) and decoding (byte stream + encoding -> Unicode stream) algorithms for each encoding and getting that implemented interoperably would also be great.

Current Implementations

Does this differ per platform? Opera might differ a bit on Mac.

Data

Integrate this awesome data somehow:

Opera

Matching

UTS22 and strips leading x characters. (For now, plan is to switch to removing leading and trailing whitespace and ASCII case-insensitive matching afterwards in the future.)

Encodings

Encoding Labels Decoded As Notes
7-bit us-ascii ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii windows-1252
DOS ibm866 866, cp866, csibm866, ibm866
ISO iso-8859-1 cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1 windows-1252
iso-8859-2 csisolatin2, iso88592, iso885921987, isoir101, l2, latin2
iso-8859-3 csisolatin3, iso88593, iso885931988, isoir109, l3, latin3
iso-8859-4 csisolatin4, iso88594, iso885941988, isoir110, l4, latin4
iso-8859-5 csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144
iso-8859-6 arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127
iso-8859-6-i csiso88596i, iso88596i
iso-8859-7 csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126
iso-8859-8 csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual
iso-8859-8-i csiso88598i, iso88598i
iso-8859-9 csisolatin5, iso88599, iso885991989, isoir148, l5, latin5
iso-8859-10 csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6
iso-8859-13 iso885913
iso-8859-14 iso885914, iso8859141998, isoceltic, isoir199, l8, latin8
iso-8859-15 iso885915, latin9
iso-8859-16 iso885916, iso8859162001, isoir226, l10, latin10
Win iso-8859-11 iso885911, tis620, tis6202533, windows874 Actually implemented as windows-874
windows-1250 cp1250, microsoftcp1250, windows1250
windows-1251 cp1251, microsoftcp1251, windows1251
windows-1252 cp1252, microsoftcp1252, windows1252
windows-1253 cp1253, microsoftcp1253, windows1253
windows-1254 cp1254, microsoftcp1254, windows1254
windows-1255 cp1255, microsoftcp1255, windows1255
windows-1256 cp1256, microsoftcp1256, windows1256
windows-1257 cp1257, microsoftcp1257, windows1257
windows-1258 cp1258, microsoftcp1258, windows1258
windows-sami-2 samiws2, windowssami2, ws2
Mac macintosh csmacintosh, mac, macintosh, macroman Likely disabled.
x-mac-ce macce Likely disabled.
x-mac-cyrillic maccyrillic Likely disabled.
x-mac-greek macgreek Likely disabled.
x-mac-turkish macturkish Likely disabled.
Misc. koi8-r cskoi8r, koi8r
koi8-u koi8u
tcvn tcvn, viettcvn
viscii csviscii, viscii
x-vps vps


big5 big5, cnbig5, csbig5
big5-hkscs big5hkscs
euc-jp cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese
euc-kr cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949
euc-tw euctw
gb18030 gb18030
gbk chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936
hz-gb-2312 hzgb2312
iso-2022-cn iso2022cn
iso-2022-jp csiso2022jp, iso2022jp
iso-2022-jp-1 iso2022jp1
iso-2022-kr csiso2022kr, iso2022kr
shift_jis cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j
utf-16 csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16
utf-16be utf16be
utf-16le utf16le
utf-8 utf8

Firefox

Matching

ASCII lowercasing.

Encodings

See https://wiki.mozilla.org/I18n:Charset_Aliases

Encoding Labels Decoded As Notes
7-bit us-ascii 646, ansi_x3.4-1968, ascii, us-ascii windows-1252
DOS IBM850 850, cp850, csIBM850, ibm850 csIBM850 not recognised.
IBM852 852, cp852, csIBM852, ibm852 csIBM852 not recognised.
IBM855 855, cp855, csIBM855, ibm855 csIBM855 not recognised.
IBM857 857, cp857, csIBM857, ibm857 csIBM857 not recognised.
IBM862 862, cp862, csIBM862, ibm862 csIBM862 not recognised.
IBM864 864, cp864, csIBM864, ibm-864, ibm864 csIBM864 not recognised.
IBM864i 864i, cp864i, csibm864i, ibm-864i, ibm864i
IBM866 866, cp-866, cp866, csIBM866, ibm866 csIBM866 not recognised.
ISO ISO-8859-1 cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 windows-1252
ISO-8859-2 csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2
ISO-8859-3 csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3
ISO-8859-4 csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4
ISO-8859-5 csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5
ISO-8859-6 arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6
ISO-8859-6-E csiso88596e, iso-8859-6-e
ISO-8859-6-I csiso88596i, iso-8859-6-i
ISO-8859-7 csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek
ISO-8859-8 csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual
ISO-8859-8-E csiso88598e, iso-8859-8-e
ISO-8859-8-I csiso88598i, iso-8859-8-i, iso-8859-8i
ISO-8859-9 csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5
ISO-8859-10 csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6
ISO-8859-11 iso-8859-11, iso8859-11, iso885911 windows-874
ISO-8859-13 iso-8859-13, iso8859-13, iso885913
ISO-8859-14 iso-8859-14, iso8859-14, iso885914
ISO-8859-15 iso-8859-15, iso8859-15, iso885915, iso_8859-15
ISO-8859-16 iso-8859-16
Win windows-874 ibm874, windows-874
windows-1250 cp1250, windows-1250, x-cp1250
windows-1251 ansi-1251, cp1251, windows-1251, x-cp1251
windows-1252 cp1252, windows-1252, x-cp1252
windows-1253 cp1253, windows-1253, x-cp1253
windows-1254 cp1254, windows-1254, x-cp1254
windows-1255 cp1255, windows-1255, x-cp1255
windows-1256 cp1256, windows-1256, x-cp1256
windows-1257 cp1257, windows-1257, x-cp1257
windows-1258 cp1258, windows-1258, x-cp1258
Mac x-mac-arabic x-mac-arabic
x-mac-ce x-mac-ce
x-mac-croatian x-mac-croatian
x-mac-cyrillic x-mac-cyrillic
x-mac-devanagari x-mac-devanagari
x-mac-farsi x-mac-farsi
x-mac-greek x-mac-greek
x-mac-gujarati x-mac-gujarati
x-mac-gurmukhi x-mac-gurmukhi
x-mac-hebrew x-mac-hebrew
x-mac-icelandic x-mac-icelandic
x-mac-roman csMacintosh, mac, macintosh, x-mac-roman csMacintosh not recognised.
x-mac-romanian x-mac-romanian
x-mac-turkish x-mac-turkish
x-mac-ukrainian x-mac-ukrainian
Misc. armscii-8 armscii-8
GEOSTD8 geostd8 Does not seem to work.
ISO-IR-111 csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111
KOI8-R koi8-r
KOI8-U koi8-u
T.61-8bit csiso103t618bit, iso-ir-103, t.61, t.61-8bit
TIS-620 tis-620, tis620 windows-874
VISCII csviscii, viscii
x-user-defined x-user-defined
x-viet-tcvn5712 x-viet-tcvn5712
x-viet-vps x-viet-vps


Big5 big5, csbig5, x-x-big5, zh_tw-big5
Big5-HKSCS big5-hkscs
EUC-JP cseucjpkdfmtjapanese, euc-jp, x-euc-jp
EUC-KR 5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601 x-windows-949 This converter is assymetric. In ToUnicode direction, it is generous and acts as Windows-949. It also supports 8-byte sequences for 8,822 Hangul syllables not encoded as precomposed forms in KS X 1001. In FromUnicode direction, it is strict and generate 8-byte sequences for those 8,822 Hangul syllables instead of 2-byte sequences in windows-949.
gb18030 gb18030
GB2312 chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc x-gbk
HZ-GB-2312 hz-gb-2312
ISO-2022-CN iso-2022-cn, iso-2022-cn-ext
ISO-2022-JP csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2
ISO-2022-KR csiso2022kr, iso-2022-kr
Shift_JIS csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis
windows-936 windows-936
x-euc-tw cns11643, x-euc-tw, zh_tw-euc
x-gbk gbk, x-gbk
x-johab x-johab
x-windows-949 ks_c_5601-1987, x-windows-949
UTF-16 utf-16 Recognized as BE or LE by BOM or byte sniffing
UTF-16BE csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be
UTF-16LE utf-16le, x-iso-10646-ucs-2-le
UTF-32 utf-32 Recognized as BE or LE by BOM or byte sniffing
UTF-32BE iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be
UTF-32LE utf-32le, x-iso-10646-ucs-4-le
UTF-7 csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7
UTF-8 unicode-1-1-utf-8, utf-8, utf8
x-imap4-modified-utf7 x-imap4-modified-utf7

Table generated from <http://mxr.mozilla.org/mozilla1.9.1/source/intl/uconv/src/charsetalias.properties> (corresponds to Firefox 3.5.2).

Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.

8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.

Bugs: Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.

Internet Explorer

Matching

Strips leading and trailing whitespace and then does ASCII(?) case-insensitive matching. (Matches HTML5.)

Encodings

Encoding Labels Decoded As Notes
7-bit us-ascii ansi_x3.4-1968, ansi_x3.4-1986, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii (code page: 20127)
x-ia5 irv, x-ia5 (code page: 20105) Most significant bit ignored.
x-ia5-german din_66003, german, x-ia5-german (code page: 20106) Most significant bit ignored.
x-ia5-norwegian norwegian, ns_4551-1, x-ia5-norwegian (code page: 20108) Most significant bit ignored. Actually decoded as NS 4551-2, not NS 4551-1.
x-ia5-swedish sen_850200_b, swedish, x-ia5-swedish (code page: 20107) Most significant bit ignored. Actually decoded as SEN 85 02 00 Annex C, not SEN 85 02 00 Annex B.
DOS cp866 cp866, ibm866 (code page: 866)
dos-720 dos-720 (code page: 720)
dos-862 cp862, dos-862, ibm862 (code page: 862)
ibm00858 ccsid00858, cp00858, cp858, ibm00858, pc-multilingual-850+euro (code page: 858)
ibm437 437, cp437, cspc8codepage437, ibm437 (code page: 437)
ibm737 ibm737 (code page: 737)
ibm775 ibm775 (code page: 775)
ibm850 cp850, ibm850 (code page: 850)
ibm852 cp852, ibm852 (code page: 852)
ibm855 cp855, ibm855 (code page: 855)
ibm857 cp857, ibm857 (code page: 857)
ibm860 cp860, ibm860 (code page: 860)
ibm861 cp861, ibm861 (code page: 861)
ibm863 cp863, ibm863 (code page: 863)
ibm864 cp864, ibm864 (code page: 864)
ibm865 cp865, ibm865 (code page: 865)
ibm869 cp869, ibm869 (code page: 869)
ISO iso-8859-1 cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1 windows-1252 (code page: 28591)
iso-8859-2 csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 (code page: 28592)
iso-8859-3 csisolatin3, iso-8859-3, iso-ir-109, iso_8859-3, iso_8859-3:1988, l3, latin3 (code page: 28593)
iso-8859-4 csisolatin4, iso-8859-4, iso-ir-110, iso_8859-4, iso_8859-4:1988, l4, latin4 (code page: 28594)
iso-8859-5 csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso_8859-5, iso_8859-5:1988 (code page: 28595)
iso-8859-6 arabic, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso_8859-6, iso_8859-6:1987 (code page: 28596)
iso-8859-7 csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso_8859-7, iso_8859-7:1987 (code page: 28597)
iso-8859-8 csisolatinhebrew, hebrew, iso-8859-8, iso-8859-8 visual, iso-ir-138, iso_8859-8, iso_8859-8:1988, logical, visual windows-1254 (code page: 28598)
iso-8859-8-i iso-8859-8-i (code page: 38598)
iso-8859-9 csisolatin5, iso-8859-9, iso-ir-148, iso_8859-9, iso_8859-9:1989, l5, latin5 (code page: 28599)
iso-8859-13 iso-8859-13 (code page: 28603)
iso-8859-15 csisolatin9, iso-8859-15, iso_8859-15, l9, latin9 (code page: 28605)
Win windows-874 dos-874, iso-8859-11, tis-620, windows-874 (code page: 874)
windows-1250 windows-1250, x-cp1250 (code page: 1250)
windows-1251 windows-1251, x-cp1251 (code page: 1251)
windows-1252 windows-1252, x-ansi (code page: 1252)
windows-1253 windows-1253 (code page: 1253)
windows-1254 windows-1254 (code page: 1254)
windows-1255 windows-1255 (code page: 1255)
windows-1256 cp1256, windows-1256 (code page: 1256)
windows-1257 windows-1257 (code page: 1257)
windows-1258 windows-1258 (code page: 1258)
Mac macintosh macintosh (code page: 10000)
x-mac-arabic x-mac-arabic (code page: 10004)
x-mac-ce x-mac-ce (code page: 10029)
x-mac-croatian x-mac-croatian (code page: 10082)
x-mac-cyrillic x-mac-cyrillic (code page: 10007)
x-mac-greek x-mac-greek (code page: 10006)
x-mac-hebrew x-mac-hebrew (code page: 10005)
x-mac-icelandic x-mac-icelandic (code page: 10079)
x-mac-romanian x-mac-romanian (code page: 10010)
x-mac-thai x-mac-thai (code page: 10021)
x-mac-turkish x-mac-turkish (code page: 10081)
x-mac-ukrainian x-mac-ukrainian (code page: 10017)
Misc. asmo-708 asmo-708 (code page: 708)
koi8-r cskoi8r, koi, koi8, koi8-r, koi8r (code page: 20866)
koi8-u koi8-ru, koi8-u (code page: 21866)
x-cp20261 x-cp20261 (code page: 20261) T.61 / ISO/IEC_6937
x-cp20269 x-cp20269 (code page: 20269) T.61 / ISO/IEC_6937 (non-combining accents?)
x-europa x-europa (code page: 29001) What is this?
x-iscii-as x-iscii-as (code page: 57006)
x-iscii-be x-iscii-be (code page: 57003)
x-iscii-de x-iscii-de (code page: 57002)
x-iscii-gu x-iscii-gu (code page: 57010)
x-iscii-ka x-iscii-ka (code page: 57008)
x-iscii-ma x-iscii-ma (code page: 57009)
x-iscii-or x-iscii-or (code page: 57007)
x-iscii-pa x-iscii-pa (code page: 57011)
x-iscii-ta x-iscii-ta (code page: 57004)
x-iscii-te x-iscii-te (code page: 57005)


big5 big5, big5-hkscs, cn-big5, csbig5, x-x-big5 (code page: 950)
csiso2022jp csiso2022jp (code page: 50221)
euc-cn euc-cn, x-euc-cn (code page: 51936)
euc-jp cseucpkdfmtjapanese, euc-jp, extended_unix_code_packed_format_for_japanese, iso-2022-jpeuc, x-euc, x-euc-jp (code page: 51932)
euc-kr cseuckr, euc-kr, iso-2022-kr-8, iso-2022-kr-8bit (code page: 51949)
gb18030 gb18030 (code page: 54936)
gb2312 chinese, cn-gb, csgb2312, csgb231280, csiso58gb231280, gb2312, gb2312-80, gb231280, gbk, gb_2312-80, iso-ir-58 (code page: 936) GBK superset. The "gb2312-80" label does not seem to work in IE8.
hz-gb-2312 hz-gb-2312 (code page: 52936)
iso-2022-jp iso-2022-jp (code page: 50220)
iso-2022-kr csiso2022kr, iso-2022-kr, iso-2022-kr-7, iso-2022-kr-7bit (code page: 50225)
johab johab (code page: 1361)
ks_c_5601-1987 csksc56011987, iso-ir-149, korean, ks-c-5601, ks-c5601, ksc5601, ksc_5601, ks_c_5601, ks_c_5601-1987, ks_c_5601-1989, ks_c_5601_1987 (code page: 949) EUC-KR superset
shift_jis csshiftjis, cswindows31j, ms_kanji, shift-jis, shift_jis, sjis, windows-31j, x-ms-cp932, x-sjis (code page: 932)
x-chinese-cns x-chinese-cns (code page: 20000) EUC-TW
x-chinese-eten x-chinese-eten (code page: 20002) Other encoding of the Taiwanese CNS character set.
x-cp20001 x-cp20001 (code page: 20001) TW (not tested)
x-cp20003 x-cp20003 (code page: 20003) TW (not tested)
x-cp20004 x-cp20004 (code page: 20004) TW (not tested)
x-cp20005 x-cp20005 (code page: 20005) TW (not tested)
x-mac-chinesesimp x-mac-chinesesimp (code page: 10008) EUC-CN superset. Handled as plain ENC-CN?
x-mac-chinesetrad x-mac-chinesetrad (code page: 10002) Big5 superset.
x-mac-japanese x-mac-japanese (code page: 10001) Shift_JIS superset. Handled as Windows Shift_JIS?
x-mac-korean x-mac-korean (code page: 10003) EUC-KR superset. Handled as plain EUC-KR?
x-cp20936 x-cp20936 (code page: 20936) EUC-CN (not the GBK superset)
x-cp20949 x-cp20949 (code page: 20949) EUC-KR (not the Windows superset)
x-cp50227 x-cp50227 (code page: 50227) GBK, including at least some Windows extensions.
(???) x-cp50229 (code page: 50229) ISO-2022-CN subset? Includes GB 2312-80 (CN) and CNS 11643-1992 Plane 1 (TW).
unicodefffe unicodefffe, utf-16be (code page: 1201) UTF-16BE
utf-16 iso-10646-ucs-2, ucs-2, unicode, utf-16, utf-16le (code page: 1200) UTF-16LE. UCS-2 is not(!) taken to mean UTF-16.
utf-7 csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-1-1-utf-7, x-unicode-2-0-utf-7 (code page: 65000)
utf-8 unicode-1-1-utf-8, unicode-2-0-utf-8, utf-8, x-unicode-1-1-utf-8, x-unicode-2-0-utf-8 (code page: 65001)
EBC
DIC
cp875 cp875 (code page: 875) EBCDIC Greece
cp1025 cp1025 (code page: 21025) EBCDIC Cyrilllic Multilingual
ibm-thai csibmthai, ibm-thai (code page: 20838) EBCDIC Thailand
ibm00924 ccsid00924, cp00924, ebcdic-latin9--euro, ibm00924 (code page: 20924) EBCDIC Latin 9
ibm01047 ibm01047 (code page: 1047) EBCDIC Latin 1/Open Systems
ibm01140 ccsid01140, cp01140, ebcdic-us-37+euro, ibm01140 (code page: 1140) EBCDIC USA, Canada, etc. ECECP
ibm01141 ccsid01141, cp01141, ebcdic-de-273+euro, ibm01141 (code page: 1141) EBCDIC Austria, Germany ECECP
ibm01142 ccsid01142, cp01142, ebcdic-dk-277+euro, ebcdic-no-277+euro, ibm01142 (code page: 1142) EBCDIC Denmark, Norway ECECP
ibm01143 ccsid01143, cp01143, ebcdic-fi-278+euro, ebcdic-se-278+euro, ibm01143 (code page: 1143) EBCDIC Finland, Sweden ECECP
ibm01144 ccsid01144, cp01144, ebcdic-it-280+euro, ibm01144 (code page: 1144) EBCDIC Italy ECECP
ibm01145 ccsid01145, cp01145, ebcdic-es-284+euro, ibm01145 (code page: 1145) EBCDIC Spain, Latin America (Spanish)
ibm01146 ccsid01146, cp01146, ebcdic-gb-285+euro, ibm01146 (code page: 1146) EBCDIC UK ECECP
ibm01147 ccsid01147, cp01147, ebcdic-fr-297+euro, ibm01147 (code page: 1147) EBCDIC France ECECP
ibm01148 ccsid01148, cp01148, ebcdic-international-500+euro, ibm01148 (code page: 1148) EBCDIC International ECECP
ibm01149 ccsid01149, cp01149, ebcdic-is-871+euro, ibm01149 (code page: 1149) EBCDIC Iceland ECECP
ibm037 cp037, csibm037, ebcdic-cp-ca, ebcdic-cp-nl, ebcdic-cp-us, ebcdic-cp-wt, ibm037 (code page: 37) EBCDIC USA/Canada - CECP
ibm1026 cp1026, csibm1026, ibm1026 (code page: 1026) EBCDIC Latin #5 - Turkey
ibm273 cp273, csibm273, ibm273 (code page: 20273) EBCDIC Germany F.R./Austria - CECP
ibm277 csibm277, ebcdic-cp-dk, ebcdic-cp-no, ibm277 (code page: 20277) EBCDIC Denmark, Norway - CECP
ibm278 cp278, csibm278, ebcdic-cp-fi, ebcdic-cp-se, ibm278 (code page: 20278) EBCDIC Finland, Sweden - CECP
ibm280 cp280, csibm280, ebcdic-cp-it, ibm280 (code page: 20280) EBCDIC Italy - CECP
ibm284 cp284, csibm284, ebcdic-cp-es, ibm284 (code page: 20284) EBCDIC Spain/Latin America - CECP
ibm285 cp285, csibm285, ebcdic-cp-gb, ibm285 (code page: 20285) EBCDIC United Kingdom - CECP
ibm290 cp290, csibm290, ebcdic-jp-kana, ibm290 (code page: 20290) EBCDIC Japanese (Katakana) Extended. Katakana replace lowercase EBCDIC.
ibm297 cp297, csibm297, ebcdic-cp-fr, ibm297 (code page: 20297) EBCDIC France - CECP
ibm420 cp420, csibm420, ebcdic-cp-ar1, ibm420 (code page: 20420) EBCDIC Arabic Bilingual
ibm423 cp423, csibm423, ebcdic-cp-gr, ibm423 (code page: 20423) EBCDIC Greece - 183
ibm424 cp424, csibm424, ebcdic-cp-he, ibm424 (code page: 20424) EBCDIC Israel (Hebrew)
ibm500 cp500, csibm500, ebcdic-cp-be, ebcdic-cp-ch, ibm500 (code page: 500) EBCDIC International #5
ibm870 cp870, csibm870, ebcdic-cp-roece, ebcdic-cp-yu, ibm870 (code page: 870) EBCDIC Latin 2, Multilingual
ibm871 cp871, csibm871, ebcdic-cp-is, ibm871 (code page: 20871) EBCDIC Iceland
ibm880 cp880, csibm880, ebcdic-cyrillic, ibm880 (code page: 20880) EBCDIC Cyrillic, Multilingual
ibm905 cp905, csibm905, ebcdic-cp-tr, ibm905 (code page: 20905) EBCDIC Latin 3
x-ebcdic-koreanextended x-ebcdic-koreanextended (code page: 20833) EBCDIC Korean (some variant)
(???) x-cp21027 (code page: 21027) EBCDIC Japanese (some variant). Certain EBCDIC letters/digits decoded incorrectly.
¿···? (???) cp930 (code page: 50930) JAPAN MIX EBCDIC? Appears to be ASCII-compatible...
(???) cp933 (code page: 50933) KOREA MIX EBCDIC? Appears to be ASCII-compatible...
(???) cp935 (code page: 50935) S-CHINESE MIX EBCDIC? Appears to be ASCII-compatible...
(???) cp937 (code page: 50937) T-CHINESE MIX EBCDIC? Appears to be ASCII-compatible...
(???) cp939 (code page: 50939) JAPAN MIX EBCDIC? Appears to be ASCII-compatible...
(???) x-ebcdic-japaneseanduscanada (code page: 50931) EBCDIC? Appears to be ASCII-compatible...

All EBCDIC encodings contain the letters A–Z, a–z and the digits 0–9 in EBCDIC positions (unless there is a note in the table saying otherwise).

Data

Source for the encodings and labels data: http://lists.w3.org/Archives/Public/public-html-comments/2009Sep/att-0050/ie.encodings.txt

Labels and code pages .NET supports: http://blogs.msdn.com/shawnste/archive/2009/08/18/alternate-encoding-names-recognized-by-net-ie.aspx (there should be few, if any, differences with the above)

Different data for encodings and labels IE supports (might be more accurate): http://html5.org/temp/2009/ie-encodings.htm (original: http://web.archive.org/web/20080204211015/http://www.hitachi-to.co.jp/prod/prod_2/inter/emk/help/TextEncoder/CodePage.htm)

Safari

Matching

UTS22

Encodings

FIXME

Data

Based on discussion with Maciej (Apple) archived here: http://krijnhoetmer.nl/irc-logs/whatwg/20090909#l-110

Safari uses the system version of ICU on Mac (4.0 for Snow Leopard, 3.6 for Leopard and 3.2 for Tiger) and in addition supports TEC on Mac for encodings that are not in ICU. (Unclear how much of TEC is enabled.)

On Windows Safari ships with ICU 4.0


According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.

ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.

See also: http://trac.webkit.org/browser/trunk/WebCore/platform/text

Chrome

Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added.

Chrome trunk uses ICU 4.2.

Thoughts

Anne

If it can be agreed upon that all non-UTF-8 and non-UTF-16 encodings are legacy encodings I personally would not mind advocating that we should drop support for US-ASCII and ISO-8859-1 completely in favor of Windows-1252 (and do the same for similar situations). I.e. that US-ASCII and ISO-8859-1 labels simply map to Windows-1252. This should simplify code a little bit as well.

I also think that we should ban UTF-7, UTF-32 and all EBCDIC encodings. This is already mostly done by HTML5.


I wonder if we can standardize (document to start with) the encoding detection algorithm. The list of encodings is fixed. The list of legacy pages is also fairly fixed. The detection algorithms in browsers should be fairly stable. Certainly looks possible.

E-mails

WHATWG got these e-mails that we should make sure to cover as part of this:

Spec notes

This is what the spec used to say about encodings:

  <p>In addition, user agents must support the aliases given in the
  following table for every character encoding they support, so that
  labels from the first column are treated as equivalent to the labels
  given in the corresponding cell from the second column on the same
  row.</p>
  
  <table>
   <caption>Additional character encoding aliases</caption>
   <thead>
    <tr> <th> Alias <th> Corresponding encoding <th> References
   <tbody>
    <tr> <td> x-sjis <td> windows-31J <td>
         <a href="#refsSHIFTJIS">[SHIFTJIS]</a>
         <a href="#refsWIN31J">[WIN31J]</a>
    <tr> <td> windows-932 <td> windows-31J <td>
         <a href="#refsWIN31J">[WIN31J]</a>
    <tr> <td> x-x-big5 <td> Big5 <td>
         <a href="#refsBIG5">[BIG5]</a>
   </tbody>
  </table>

ICU in Chrome and Safari

Giving a link to or actually including the info on what Safari and Chrome support would be nice. It seems like this would be at least a subset, but it sounds like more may have been added from the text above.

http://demo.icu-project.org/icu-bin/convexp