Web Encodings

Attempt at fixing the Web encoding problem.

Goals

Document existing practices by describing for each browser
- The list of supported encodings.
- The list of supported labels for those encodings.
- The matching algorithm for labels.
Converge the various used algorithms by
- Defining a list of encodings everyone has to support. Browsers must not support more encodings than on that list.
- Defining a list of supported labels for those encodings. Browsers must not support more labels than on that list.
- Defining the matching algorithm. (HTML5 has been updated with a better one now.)
Get the new rules implemented

Documenting more exactly the encoding (Unicode stream + encoding -> byte stream) and decoding (byte stream + encoding -> Unicode stream) algorithms for each encoding and getting that implemented interoperably would also be great.

Current Implementations

Does this differ per platform? Opera might differ a bit on Mac.

Data

Integrate this awesome data somehow:

Opera

Matching

UTS22 and strips leading x characters. (For now, plan is to switch to removing leading and trailing whitespace and ASCII case-insensitive matching afterwards in the future.)

Encodings

Encoding		Labels	Decoded As	Notes
7-bit	us-ascii	ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii	windows-1252
DOS	ibm866	866, cp866, csibm866, ibm866
ISO	iso-8859-1	cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1	windows-1252
	iso-8859-2	csisolatin2, iso88592, iso885921987, isoir101, l2, latin2
	iso-8859-3	csisolatin3, iso88593, iso885931988, isoir109, l3, latin3
	iso-8859-4	csisolatin4, iso88594, iso885941988, isoir110, l4, latin4
	iso-8859-5	csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144
	iso-8859-6	arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127
	iso-8859-6-i	csiso88596i, iso88596i
	iso-8859-7	csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126
	iso-8859-8	csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual
	iso-8859-8-i	csiso88598i, iso88598i
	iso-8859-9	csisolatin5, iso88599, iso885991989, isoir148, l5, latin5
	iso-8859-10	csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6
	iso-8859-13	iso885913
	iso-8859-14	iso885914, iso8859141998, isoceltic, isoir199, l8, latin8
	iso-8859-15	iso885915, latin9
	iso-8859-16	iso885916, iso8859162001, isoir226, l10, latin10
Win	iso-8859-11	iso885911, tis620, tis6202533, windows874		Actually implemented as windows-874
	windows-1250	cp1250, microsoftcp1250, windows1250
	windows-1251	cp1251, microsoftcp1251, windows1251
	windows-1252	cp1252, microsoftcp1252, windows1252
	windows-1253	cp1253, microsoftcp1253, windows1253
	windows-1254	cp1254, microsoftcp1254, windows1254
	windows-1255	cp1255, microsoftcp1255, windows1255
	windows-1256	cp1256, microsoftcp1256, windows1256
	windows-1257	cp1257, microsoftcp1257, windows1257
	windows-1258	cp1258, microsoftcp1258, windows1258
	windows-sami-2	samiws2, windowssami2, ws2
Mac	macintosh	csmacintosh, mac, macintosh, macroman		Likely disabled.
	x-mac-ce	macce		Likely disabled.
	x-mac-cyrillic	maccyrillic		Likely disabled.
	x-mac-greek	macgreek		Likely disabled.
	x-mac-turkish	macturkish		Likely disabled.
Misc.	koi8-r	cskoi8r, koi8r
	koi8-u	koi8u
	tcvn	tcvn, viettcvn
	viscii	csviscii, viscii
	x-vps	vps
中日韓	big5	big5, cnbig5, csbig5
	big5-hkscs	big5hkscs
	euc-jp	cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese
	euc-kr	cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949
	euc-tw	euctw
	gb18030	gb18030
	gbk	chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936
	hz-gb-2312	hzgb2312
	iso-2022-cn	iso2022cn
	iso-2022-jp	csiso2022jp, iso2022jp
	iso-2022-jp-1	iso2022jp1
	iso-2022-kr	csiso2022kr, iso2022kr
	shift_jis	cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j
�	utf-16	csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16
	utf-16be	utf16be
	utf-16le	utf16le
	utf-8	utf8

Firefox

Matching

ASCII lowercasing.

Encodings

Encoding		Labels	Decoded As	Notes
7-bit	us-ascii	646, ansi_x3.4-1968, ascii, us-ascii	windows-1252
DOS	IBM850	850, cp850, csIBM850, ibm850		csIBM850 not recognised.
	IBM852	852, cp852, csIBM852, ibm852		csIBM852 not recognised.
	IBM855	855, cp855, csIBM855, ibm855		csIBM855 not recognised.
	IBM857	857, cp857, csIBM857, ibm857		csIBM857 not recognised.
	IBM862	862, cp862, csIBM862, ibm862		csIBM862 not recognised.
	IBM864	864, cp864, csIBM864, ibm-864, ibm864		csIBM864 not recognised.
	IBM864i	864i, cp864i, csibm864i, ibm-864i, ibm864i
	IBM866	866, cp-866, cp866, csIBM866, ibm866		csIBM866 not recognised.
ISO	ISO-8859-1	cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1	windows-1252
	ISO-8859-2	csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2
	ISO-8859-3	csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3
	ISO-8859-4	csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4
	ISO-8859-5	csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5
	ISO-8859-6	arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6
	ISO-8859-6-E	csiso88596e, iso-8859-6-e
	ISO-8859-6-I	csiso88596i, iso-8859-6-i
	ISO-8859-7	csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek
	ISO-8859-8	csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual
	ISO-8859-8-E	csiso88598e, iso-8859-8-e
	ISO-8859-8-I	csiso88598i, iso-8859-8-i, iso-8859-8i
	ISO-8859-9	csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5
	ISO-8859-10	csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6
	ISO-8859-11	iso-8859-11, iso8859-11, iso885911	windows-874
	ISO-8859-13	iso-8859-13, iso8859-13, iso885913
	ISO-8859-14	iso-8859-14, iso8859-14, iso885914
	ISO-8859-15	iso-8859-15, iso8859-15, iso885915, iso_8859-15
	ISO-8859-16	iso-8859-16
Win	windows-874	ibm874, windows-874
	windows-1250	cp1250, windows-1250, x-cp1250
	windows-1251	ansi-1251, cp1251, windows-1251, x-cp1251
	windows-1252	cp1252, windows-1252, x-cp1252
	windows-1253	cp1253, windows-1253, x-cp1253
	windows-1254	cp1254, windows-1254, x-cp1254
	windows-1255	cp1255, windows-1255, x-cp1255
	windows-1256	cp1256, windows-1256, x-cp1256
	windows-1257	cp1257, windows-1257, x-cp1257
	windows-1258	cp1258, windows-1258, x-cp1258
Mac	x-mac-arabic	x-mac-arabic
	x-mac-ce	x-mac-ce
	x-mac-croatian	x-mac-croatian
	x-mac-cyrillic	x-mac-cyrillic
	x-mac-devanagari	x-mac-devanagari
	x-mac-farsi	x-mac-farsi
	x-mac-greek	x-mac-greek
	x-mac-gujarati	x-mac-gujarati
	x-mac-gurmukhi	x-mac-gurmukhi
	x-mac-hebrew	x-mac-hebrew
	x-mac-icelandic	x-mac-icelandic
	x-mac-roman	csMacintosh, mac, macintosh, x-mac-roman		csMacintosh not recognised.
	x-mac-romanian	x-mac-romanian
	x-mac-turkish	x-mac-turkish
	x-mac-ukrainian	x-mac-ukrainian
Misc.	armscii-8	armscii-8
	GEOSTD8	geostd8		Does not seem to work.
	ISO-IR-111	csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111
	KOI8-R	koi8-r
	KOI8-U	koi8-u
	T.61-8bit	csiso103t618bit, iso-ir-103, t.61, t.61-8bit
	TIS-620	tis-620, tis620	windows-874
	VISCII	csviscii, viscii
	x-user-defined	x-user-defined
	x-viet-tcvn5712	x-viet-tcvn5712
	x-viet-vps	x-viet-vps
中日韓	Big5	big5, csbig5, x-x-big5, zh_tw-big5
	Big5-HKSCS	big5-hkscs
	EUC-JP	cseucjpkdfmtjapanese, euc-jp, x-euc-jp
	EUC-KR	5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601	x-windows-949	This converter is assymetric. In ToUnicode direction, it is generous and acts as Windows-949. It also supports 8-byte sequences for 8,822 Hangul syllables not encoded as precomposed forms in KS X 1001. In FromUnicode direction, it is strict and generate 8-byte sequences for those 8,822 Hangul syllables instead of 2-byte sequences in windows-949.
	gb18030	gb18030
	GB2312	chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc	x-gbk
	HZ-GB-2312	hz-gb-2312
	ISO-2022-CN	iso-2022-cn, iso-2022-cn-ext
	ISO-2022-JP	csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2
	ISO-2022-KR	csiso2022kr, iso-2022-kr
	Shift_JIS	csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis
	windows-936	windows-936
	x-euc-tw	cns11643, x-euc-tw, zh_tw-euc
	x-gbk	gbk, x-gbk
	x-johab	x-johab
	x-windows-949	ks_c_5601-1987, x-windows-949
�	UTF-16	utf-16		Recognized as BE or LE by BOM or byte sniffing
	UTF-16BE	csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be
	UTF-16LE	utf-16le, x-iso-10646-ucs-2-le
	UTF-32	utf-32		Recognized as BE or LE by BOM or byte sniffing
	UTF-32BE	iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be
	UTF-32LE	utf-32le, x-iso-10646-ucs-4-le
	UTF-7	csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7
	UTF-8	unicode-1-1-utf-8, utf-8, utf8
	x-imap4-modified-utf7	x-imap4-modified-utf7

Table generated from <http://mxr.mozilla.org/mozilla1.9.1/source/intl/uconv/src/charsetalias.properties> (corresponds to Firefox 3.5.2).

Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.

8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.

Bugs: Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.

Internet Explorer

Matching

Strips leading and trailing whitespace and then does ASCII(?) case-insensitive matching. (Matches HTML5.)

Encodings

Encoding		Labels	Decoded As	Notes
7-bit	us-ascii	ansi_x3.4-1968, ansi_x3.4-1986, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii		(code page: 20127)
	x-ia5	irv, x-ia5		(code page: 20105) Most significant bit ignored.
	x-ia5-german	din_66003, german, x-ia5-german		(code page: 20106) Most significant bit ignored.
	x-ia5-norwegian	norwegian, ns_4551-1, x-ia5-norwegian		(code page: 20108) Most significant bit ignored. Actually decoded as NS 4551-2, not NS 4551-1.
	x-ia5-swedish	sen_850200_b, swedish, x-ia5-swedish		(code page: 20107) Most significant bit ignored. Actually decoded as SEN 85 02 00 Annex C, not SEN 85 02 00 Annex B.
DOS	cp866	cp866, ibm866		(code page: 866)
	dos-720	dos-720		(code page: 720)
	dos-862	cp862, dos-862, ibm862		(code page: 862)
	ibm00858	ccsid00858, cp00858, cp858, ibm00858, pc-multilingual-850+euro		(code page: 858)
	ibm437	437, cp437, cspc8codepage437, ibm437		(code page: 437)
	ibm737	ibm737		(code page: 737)
	ibm775	ibm775		(code page: 775)
	ibm850	cp850, ibm850		(code page: 850)
	ibm852	cp852, ibm852		(code page: 852)
	ibm855	cp855, ibm855		(code page: 855)
	ibm857	cp857, ibm857		(code page: 857)
	ibm860	cp860, ibm860		(code page: 860)
	ibm861	cp861, ibm861		(code page: 861)
	ibm863	cp863, ibm863		(code page: 863)
	ibm864	cp864, ibm864		(code page: 864)
	ibm865	cp865, ibm865		(code page: 865)
	ibm869	cp869, ibm869		(code page: 869)
ISO	iso-8859-1	cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1	windows-1252	(code page: 28591)
	iso-8859-2	csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2		(code page: 28592)
	iso-8859-3	csisolatin3, iso-8859-3, iso-ir-109, iso_8859-3, iso_8859-3:1988, l3, latin3		(code page: 28593)
	iso-8859-4	csisolatin4, iso-8859-4, iso-ir-110, iso_8859-4, iso_8859-4:1988, l4, latin4		(code page: 28594)
	iso-8859-5	csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso_8859-5, iso_8859-5:1988		(code page: 28595)
	iso-8859-6	arabic, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso_8859-6, iso_8859-6:1987		(code page: 28596)
	iso-8859-7	csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso_8859-7, iso_8859-7:1987		(code page: 28597)
	iso-8859-8	csisolatinhebrew, hebrew, iso-8859-8, iso-8859-8 visual, iso-ir-138, iso_8859-8, iso_8859-8:1988, logical, visual	windows-1254	(code page: 28598)
	iso-8859-8-i	iso-8859-8-i		(code page: 38598)
	iso-8859-9	csisolatin5, iso-8859-9, iso-ir-148, iso_8859-9, iso_8859-9:1989, l5, latin5		(code page: 28599)
	iso-8859-13	iso-8859-13		(code page: 28603)
	iso-8859-15	csisolatin9, iso-8859-15, iso_8859-15, l9, latin9		(code page: 28605)
Win	windows-874	dos-874, iso-8859-11, tis-620, windows-874		(code page: 874)
	windows-1250	windows-1250, x-cp1250		(code page: 1250)
	windows-1251	windows-1251, x-cp1251		(code page: 1251)
	windows-1252	windows-1252, x-ansi		(code page: 1252)
	windows-1253	windows-1253		(code page: 1253)
	windows-1254	windows-1254		(code page: 1254)
	windows-1255	windows-1255		(code page: 1255)
	windows-1256	cp1256, windows-1256		(code page: 1256)
	windows-1257	windows-1257		(code page: 1257)
	windows-1258	windows-1258		(code page: 1258)
Mac	macintosh	macintosh		(code page: 10000)
	x-mac-arabic	x-mac-arabic		(code page: 10004)
	x-mac-ce	x-mac-ce		(code page: 10029)
	x-mac-croatian	x-mac-croatian		(code page: 10082)
	x-mac-cyrillic	x-mac-cyrillic		(code page: 10007)
	x-mac-greek	x-mac-greek		(code page: 10006)
	x-mac-hebrew	x-mac-hebrew		(code page: 10005)
	x-mac-icelandic	x-mac-icelandic		(code page: 10079)
	x-mac-romanian	x-mac-romanian		(code page: 10010)
	x-mac-thai	x-mac-thai		(code page: 10021)
	x-mac-turkish	x-mac-turkish		(code page: 10081)
	x-mac-ukrainian	x-mac-ukrainian		(code page: 10017)
Misc.	asmo-708	asmo-708		(code page: 708)
	koi8-r	cskoi8r, koi, koi8, koi8-r, koi8r		(code page: 20866)
	koi8-u	koi8-ru, koi8-u		(code page: 21866)
	x-cp20261	x-cp20261		(code page: 20261) T.61 / ISO/IEC_6937
	x-cp20269	x-cp20269		(code page: 20269) T.61 / ISO/IEC_6937 (non-combining accents?)
	x-europa	x-europa		(code page: 29001) What is this?
	x-iscii-as	x-iscii-as		(code page: 57006)
	x-iscii-be	x-iscii-be		(code page: 57003)
	x-iscii-de	x-iscii-de		(code page: 57002)
	x-iscii-gu	x-iscii-gu		(code page: 57010)
	x-iscii-ka	x-iscii-ka		(code page: 57008)
	x-iscii-ma	x-iscii-ma		(code page: 57009)
	x-iscii-or	x-iscii-or		(code page: 57007)
	x-iscii-pa	x-iscii-pa		(code page: 57011)
	x-iscii-ta	x-iscii-ta		(code page: 57004)
	x-iscii-te	x-iscii-te		(code page: 57005)
中日韓	big5	big5, big5-hkscs, cn-big5, csbig5, x-x-big5		(code page: 950)
	csiso2022jp	csiso2022jp		(code page: 50221)
	euc-cn	euc-cn, x-euc-cn		(code page: 51936)
	euc-jp	cseucpkdfmtjapanese, euc-jp, extended_unix_code_packed_format_for_japanese, iso-2022-jpeuc, x-euc, x-euc-jp		(code page: 51932)
	euc-kr	cseuckr, euc-kr, iso-2022-kr-8, iso-2022-kr-8bit		(code page: 51949)
	gb18030	gb18030		(code page: 54936)
	gb2312	chinese, cn-gb, csgb2312, csgb231280, csiso58gb231280, gb2312, gb2312-80, gb231280, gbk, gb_2312-80, iso-ir-58		(code page: 936) GBK superset. The "gb2312-80" label does not seem to work in IE8.
	hz-gb-2312	hz-gb-2312		(code page: 52936)
	iso-2022-jp	iso-2022-jp		(code page: 50220)
	iso-2022-kr	csiso2022kr, iso-2022-kr, iso-2022-kr-7, iso-2022-kr-7bit		(code page: 50225)
	johab	johab		(code page: 1361)
	ks_c_5601-1987	csksc56011987, iso-ir-149, korean, ks-c-5601, ks-c5601, ksc5601, ksc_5601, ks_c_5601, ks_c_5601-1987, ks_c_5601-1989, ks_c_5601_1987		(code page: 949) EUC-KR superset
	shift_jis	csshiftjis, cswindows31j, ms_kanji, shift-jis, shift_jis, sjis, windows-31j, x-ms-cp932, x-sjis		(code page: 932)
	x-chinese-cns	x-chinese-cns		(code page: 20000) EUC-TW
	x-chinese-eten	x-chinese-eten		(code page: 20002) Other encoding of the Taiwanese CNS character set.
	x-cp20001	x-cp20001		(code page: 20001) TW (not tested)
	x-cp20003	x-cp20003		(code page: 20003) TW (not tested)
	x-cp20004	x-cp20004		(code page: 20004) TW (not tested)
	x-cp20005	x-cp20005		(code page: 20005) TW (not tested)
	x-mac-chinesesimp	x-mac-chinesesimp		(code page: 10008) EUC-CN superset. Handled as plain ENC-CN?
	x-mac-chinesetrad	x-mac-chinesetrad		(code page: 10002) Big5 superset.
	x-mac-japanese	x-mac-japanese		(code page: 10001) Shift_JIS superset. Handled as Windows Shift_JIS?
	x-mac-korean	x-mac-korean		(code page: 10003) EUC-KR superset. Handled as plain EUC-KR?
	x-cp20936	x-cp20936		(code page: 20936) EUC-CN (not the GBK superset)
	x-cp20949	x-cp20949		(code page: 20949) EUC-KR (not the Windows superset)
	x-cp50227	x-cp50227		(code page: 50227) GBK, including at least some Windows extensions.
	(???)	x-cp50229		(code page: 50229) ISO-2022-CN subset? Includes GB 2312-80 (CN) and CNS 11643-1992 Plane 1 (TW).
�	unicodefffe	unicodefffe, utf-16be		(code page: 1201) UTF-16BE
	utf-16	iso-10646-ucs-2, ucs-2, unicode, utf-16, utf-16le		(code page: 1200) UTF-16LE. UCS-2 is not(!) taken to mean UTF-16.
	utf-7	csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-1-1-utf-7, x-unicode-2-0-utf-7		(code page: 65000)
	utf-8	unicode-1-1-utf-8, unicode-2-0-utf-8, utf-8, x-unicode-1-1-utf-8, x-unicode-2-0-utf-8		(code page: 65001)
EBC DIC	cp875	cp875		(code page: 875) EBCDIC Greece
	cp1025	cp1025		(code page: 21025) EBCDIC Cyrilllic Multilingual
	ibm-thai	csibmthai, ibm-thai		(code page: 20838) EBCDIC Thailand
	ibm00924	ccsid00924, cp00924, ebcdic-latin9--euro, ibm00924		(code page: 20924) EBCDIC Latin 9
	ibm01047	ibm01047		(code page: 1047) EBCDIC Latin 1/Open Systems
	ibm01140	ccsid01140, cp01140, ebcdic-us-37+euro, ibm01140		(code page: 1140) EBCDIC USA, Canada, etc. ECECP
	ibm01141	ccsid01141, cp01141, ebcdic-de-273+euro, ibm01141		(code page: 1141) EBCDIC Austria, Germany ECECP
	ibm01142	ccsid01142, cp01142, ebcdic-dk-277+euro, ebcdic-no-277+euro, ibm01142		(code page: 1142) EBCDIC Denmark, Norway ECECP
	ibm01143	ccsid01143, cp01143, ebcdic-fi-278+euro, ebcdic-se-278+euro, ibm01143		(code page: 1143) EBCDIC Finland, Sweden ECECP
	ibm01144	ccsid01144, cp01144, ebcdic-it-280+euro, ibm01144		(code page: 1144) EBCDIC Italy ECECP
	ibm01145	ccsid01145, cp01145, ebcdic-es-284+euro, ibm01145		(code page: 1145) EBCDIC Spain, Latin America (Spanish)
	ibm01146	ccsid01146, cp01146, ebcdic-gb-285+euro, ibm01146		(code page: 1146) EBCDIC UK ECECP
	ibm01147	ccsid01147, cp01147, ebcdic-fr-297+euro, ibm01147		(code page: 1147) EBCDIC France ECECP
	ibm01148	ccsid01148, cp01148, ebcdic-international-500+euro, ibm01148		(code page: 1148) EBCDIC International ECECP
	ibm01149	ccsid01149, cp01149, ebcdic-is-871+euro, ibm01149		(code page: 1149) EBCDIC Iceland ECECP
	ibm037	cp037, csibm037, ebcdic-cp-ca, ebcdic-cp-nl, ebcdic-cp-us, ebcdic-cp-wt, ibm037		(code page: 37) EBCDIC USA/Canada - CECP
	ibm1026	cp1026, csibm1026, ibm1026		(code page: 1026) EBCDIC Latin #5 - Turkey
	ibm273	cp273, csibm273, ibm273		(code page: 20273) EBCDIC Germany F.R./Austria - CECP
	ibm277	csibm277, ebcdic-cp-dk, ebcdic-cp-no, ibm277		(code page: 20277) EBCDIC Denmark, Norway - CECP
	ibm278	cp278, csibm278, ebcdic-cp-fi, ebcdic-cp-se, ibm278		(code page: 20278) EBCDIC Finland, Sweden - CECP
	ibm280	cp280, csibm280, ebcdic-cp-it, ibm280		(code page: 20280) EBCDIC Italy - CECP
	ibm284	cp284, csibm284, ebcdic-cp-es, ibm284		(code page: 20284) EBCDIC Spain/Latin America - CECP
	ibm285	cp285, csibm285, ebcdic-cp-gb, ibm285		(code page: 20285) EBCDIC United Kingdom - CECP
	ibm290	cp290, csibm290, ebcdic-jp-kana, ibm290		(code page: 20290) EBCDIC Japanese (Katakana) Extended. Katakana replace lowercase EBCDIC.
	ibm297	cp297, csibm297, ebcdic-cp-fr, ibm297		(code page: 20297) EBCDIC France - CECP
	ibm420	cp420, csibm420, ebcdic-cp-ar1, ibm420		(code page: 20420) EBCDIC Arabic Bilingual
	ibm423	cp423, csibm423, ebcdic-cp-gr, ibm423		(code page: 20423) EBCDIC Greece - 183
	ibm424	cp424, csibm424, ebcdic-cp-he, ibm424		(code page: 20424) EBCDIC Israel (Hebrew)
	ibm500	cp500, csibm500, ebcdic-cp-be, ebcdic-cp-ch, ibm500		(code page: 500) EBCDIC International #5
	ibm870	cp870, csibm870, ebcdic-cp-roece, ebcdic-cp-yu, ibm870		(code page: 870) EBCDIC Latin 2, Multilingual
	ibm871	cp871, csibm871, ebcdic-cp-is, ibm871		(code page: 20871) EBCDIC Iceland
	ibm880	cp880, csibm880, ebcdic-cyrillic, ibm880		(code page: 20880) EBCDIC Cyrillic, Multilingual
	ibm905	cp905, csibm905, ebcdic-cp-tr, ibm905		(code page: 20905) EBCDIC Latin 3
	x-ebcdic-koreanextended	x-ebcdic-koreanextended		(code page: 20833) EBCDIC Korean (some variant)
	(???)	x-cp21027		(code page: 21027) EBCDIC Japanese (some variant). Certain EBCDIC letters/digits decoded incorrectly.
¿···?	(???)	cp930		(code page: 50930) JAPAN MIX EBCDIC? Appears to be ASCII-compatible...
	(???)	cp933		(code page: 50933) KOREA MIX EBCDIC? Appears to be ASCII-compatible...
	(???)	cp935		(code page: 50935) S-CHINESE MIX EBCDIC? Appears to be ASCII-compatible...
	(???)	cp937		(code page: 50937) T-CHINESE MIX EBCDIC? Appears to be ASCII-compatible...
	(???)	cp939		(code page: 50939) JAPAN MIX EBCDIC? Appears to be ASCII-compatible...
	(???)	x-ebcdic-japaneseanduscanada		(code page: 50931) EBCDIC? Appears to be ASCII-compatible...

All EBCDIC encodings contain the letters A–Z, a–z and the digits 0–9 in EBCDIC positions (unless there is a note in the table saying otherwise).

Data

Source for the encodings and labels data: http://lists.w3.org/Archives/Public/public-html-comments/2009Sep/att-0050/ie.encodings.txt

Labels and code pages .NET supports: http://blogs.msdn.com/shawnste/archive/2009/08/18/alternate-encoding-names-recognized-by-net-ie.aspx (there should be few, if any, differences with the above)

Different data for encodings and labels IE supports (might be more accurate): http://html5.org/temp/2009/ie-encodings.htm (original: http://web.archive.org/web/20080204211015/http://www.hitachi-to.co.jp/prod/prod_2/inter/emk/help/TextEncoder/CodePage.htm)

Safari

Matching

UTS22

Encodings

FIXME

Data

Based on discussion with Maciej (Apple) archived here: http://krijnhoetmer.nl/irc-logs/whatwg/20090909#l-110

Safari uses the system version of ICU on Mac (4.0 for Snow Leopard, 3.6 for Leopard and 3.2 for Tiger) and in addition supports TEC on Mac for encodings that are not in ICU. (Unclear how much of TEC is enabled.)

On Windows Safari ships with ICU 4.0

According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.

ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.

Chrome

Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added.

Chrome trunk uses ICU 4.2.

Thoughts (Anne)

If it can be agreed upon that all non-UTF-8 and non-UTF-16 encodings are legacy encodings I personally would not mind advocating that we should drop support for US-ASCII and ISO-8859-1 completely in favor of Windows-1252 (and do the same for similar situations). I.e. that US-ASCII and ISO-8859-1 labels simply map to Windows-1252. This should simplify code a little bit as well.

I also think that we should ban UTF-7, UTF-32 and all EBCDIC encodings. This is already mostly done by HTML5.

I wonder if we can standardize (document to start with) the encoding detection algorithm. The list of encodings is fixed. The list of legacy pages is also fairly fixed. The detection algorithms in browsers should be fairly stable. All seem like indicators that would make such an effort probable.

Web Encodings

Contents

Goals

Current Implementations

Data

Opera

Matching

Encodings

Firefox

Matching

Encodings

Internet Explorer

Matching

Encodings

Data

Safari

Matching

Encodings

Data

Chrome

Thoughts (Anne)

Navigation menu

Web Encodings

Goals

Current Implementations

Data

Opera

Matching

Encodings

Firefox

Matching

Encodings

Internet Explorer

Matching

Encodings

Data

Safari

Matching

Encodings

Data

Chrome

Thoughts (Anne)

Navigation menu

Search