A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Web Encodings

From WHATWG Wiki
Revision as of 14:27, 10 September 2009 by Annevk (talk | contribs) (add IE labels)
Jump to navigation Jump to search

Attempt at fixing the Web encoding problem.

Goals

  • Document existing practices by describing for each browser
    • The list of supported encodings.
    • The list of supported labels for those encodings.
    • The matching algorithm for labels.
  • Converge the various used algorithms by
    • Defining a list of encodings everyone has to support. Browsers may not support more encodings than on that list.
    • Defining a list of supported labels for those encodings. Browsers may not support more labels than on that list.
    • Defining the matching algorithm. (HTML5 has been updated with a better one now.)
  • Get the new rules implemented

Documenting more exactly the encoding and decoding algorithms for each encoding and getting that implemented interoperably would also be great.

Current Implementations

Does this differ per platform? Opera might differ a bit on Mac.

Data

Integrate this awesome data somehow:

Opera

Matching

UTS22 and strips leading x characters. (For now, plan is to switch to removing leading and trailing whitespace and ASCII case-insensitive matching afterwards in the future.)

Encodings

Encoding Labels Decoded As Notes
big5 big5, cnbig5, csbig5
big5-hkscs big5hkscs
euc-jp cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese
euc-kr cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949
euc-tw euctw
gb18030 gb18030
gbk chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936
hz-gb-2312 hzgb2312
ibm866 866, cp866, csibm866, ibm866
iso-2022-cn iso2022cn
iso-2022-jp csiso2022jp, iso2022jp
iso-2022-jp-1 iso2022jp1
iso-2022-kr csiso2022kr, iso2022kr
iso-8859-1 cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1 windows-1252
iso-8859-2 csisolatin2, iso88592, iso885921987, isoir101, l2, latin2
iso-8859-3 csisolatin3, iso88593, iso885931988, isoir109, l3, latin3
iso-8859-4 csisolatin4, iso88594, iso885941988, isoir110, l4, latin4
iso-8859-5 csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144
iso-8859-6 arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127
iso-8859-6-i csiso88596i, iso88596i
iso-8859-7 csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126
iso-8859-8 csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual
iso-8859-8-i csiso88598i, iso88598i
iso-8859-9 csisolatin5, iso88599, iso885991989, isoir148, l5, latin5
iso-8859-10 csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6
iso-8859-11 iso885911, tis620, tis6202533, windows874 Actually implemented as windows-874
iso-8859-13 iso885913
iso-8859-14 iso885914, iso8859141998, isoceltic, isoir199, l8, latin8
iso-8859-15 iso885915, latin9
iso-8859-16 iso885916, iso8859162001, isoir226, l10, latin10
koi8-r cskoi8r, koi8r
koi8-u koi8u
macintosh csmacintosh, mac, macintosh, macroman Likely disabled.
shift_jis cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j
tcvn tcvn, viettcvn
us-ascii ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii windows-1252
utf-16 csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16
utf-16be utf16be
utf-16le utf16le
utf-8 utf8
viscii csviscii, viscii
windows-1250 cp1250, microsoftcp1250, windows1250
windows-1251 cp1251, microsoftcp1251, windows1251
windows-1252 cp1252, microsoftcp1252, windows1252
windows-1253 cp1253, microsoftcp1253, windows1253
windows-1254 cp1254, microsoftcp1254, windows1254
windows-1255 cp1255, microsoftcp1255, windows1255
windows-1256 cp1256, microsoftcp1256, windows1256
windows-1257 cp1257, microsoftcp1257, windows1257
windows-1258 cp1258, microsoftcp1258, windows1258
windows-sami-2 samiws2, windowssami2, ws2
x-mac-ce macce Likely disabled.
x-mac-cyrillic maccyrillic Likely disabled.
x-mac-greek macgreek Likely disabled.
x-mac-turkish macturkish Likely disabled.
x-vps vps

Firefox

Matching

ASCII lowercasing.

Encodings

Encoding Labels Decoded As Notes
armscii-8 armscii-8
Big5 big5, csbig5, x-x-big5, zh_tw-big5
Big5-HKSCS big5-hkscs
EUC-JP cseucjpkdfmtjapanese, euc-jp, x-euc-jp
EUC-KR 5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601 x-windows-949 This converter is assymetric. In ToUnicode direction, it is generous and acts as Windows-949. It also supports 8-byte sequences for 8,822 Hangul syllables not encoded as precomposed forms in KS X 1001. In FromUnicode direction, it is strict and generate 8-byte sequences for those 8,822 Hangul syllables instead of 2-byte sequences in windows-949.
gb18030 gb18030
GB2312 chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc x-gbk
GEOSTD8 geostd8 Does not seem to work.
HZ-GB-2312 hz-gb-2312
IBM850 850, cp850, csIBM850, ibm850 csIBM850 not recognised.
IBM852 852, cp852, csIBM852, ibm852 csIBM852 not recognised.
IBM855 855, cp855, csIBM855, ibm855 csIBM855 not recognised.
IBM857 857, cp857, csIBM857, ibm857 csIBM857 not recognised.
IBM862 862, cp862, csIBM862, ibm862 csIBM862 not recognised.
IBM864 864, cp864, csIBM864, ibm-864, ibm864 csIBM864 not recognised.
IBM864i 864i, cp864i, csibm864i, ibm-864i, ibm864i
IBM866 866, cp-866, cp866, csIBM866, ibm866 csIBM866 not recognised.
ISO-2022-CN iso-2022-cn, iso-2022-cn-ext
ISO-2022-JP csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2
ISO-2022-KR csiso2022kr, iso-2022-kr
ISO-8859-1 cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 windows-1252
ISO-8859-10 csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6
ISO-8859-11 iso-8859-11, iso8859-11, iso885911 windows-874
ISO-8859-13 iso-8859-13, iso8859-13, iso885913
ISO-8859-14 iso-8859-14, iso8859-14, iso885914
ISO-8859-15 iso-8859-15, iso8859-15, iso885915, iso_8859-15
ISO-8859-16 iso-8859-16
ISO-8859-2 csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2
ISO-8859-3 csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3
ISO-8859-4 csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4
ISO-8859-5 csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5
ISO-8859-6 arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6
ISO-8859-6-E csiso88596e, iso-8859-6-e
ISO-8859-6-I csiso88596i, iso-8859-6-i
ISO-8859-7 csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek
ISO-8859-8 csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual
ISO-8859-8-E csiso88598e, iso-8859-8-e
ISO-8859-8-I csiso88598i, iso-8859-8-i, iso-8859-8i
ISO-8859-9 csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5
ISO-IR-111 csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111
KOI8-R koi8-r
KOI8-U koi8-u
Shift_JIS csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis
T.61-8bit csiso103t618bit, iso-ir-103, t.61, t.61-8bit
TIS-620 tis-620, tis620 windows-874
us-ascii 646, ansi_x3.4-1968, ascii, us-ascii windows-1252
UTF-16 utf-16 Recognized as BE or LE by BOM or byte sniffing
UTF-16BE csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be
UTF-16LE utf-16le, x-iso-10646-ucs-2-le
UTF-32 utf-32 Recognized as BE or LE by BOM or byte sniffing
UTF-32BE iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be
UTF-32LE utf-32le, x-iso-10646-ucs-4-le
UTF-7 csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7
UTF-8 unicode-1-1-utf-8, utf-8, utf8
VISCII csviscii, viscii
windows-1250 cp1250, windows-1250, x-cp1250
windows-1251 ansi-1251, cp1251, windows-1251, x-cp1251
windows-1252 cp1252, windows-1252, x-cp1252
windows-1253 cp1253, windows-1253, x-cp1253
windows-1254 cp1254, windows-1254, x-cp1254
windows-1255 cp1255, windows-1255, x-cp1255
windows-1256 cp1256, windows-1256, x-cp1256
windows-1257 cp1257, windows-1257, x-cp1257
windows-1258 cp1258, windows-1258, x-cp1258
windows-874 ibm874, windows-874
windows-936 windows-936
x-euc-tw cns11643, x-euc-tw, zh_tw-euc
x-gbk gbk, x-gbk
x-imap4-modified-utf7 x-imap4-modified-utf7
x-johab x-johab
x-mac-arabic x-mac-arabic
x-mac-ce x-mac-ce
x-mac-croatian x-mac-croatian
x-mac-cyrillic x-mac-cyrillic
x-mac-devanagari x-mac-devanagari
x-mac-farsi x-mac-farsi
x-mac-greek x-mac-greek
x-mac-gujarati x-mac-gujarati
x-mac-gurmukhi x-mac-gurmukhi
x-mac-hebrew x-mac-hebrew
x-mac-icelandic x-mac-icelandic
x-mac-roman csMacintosh, mac, macintosh, x-mac-roman csMacintosh not recognised.
x-mac-romanian x-mac-romanian
x-mac-turkish x-mac-turkish
x-mac-ukrainian x-mac-ukrainian
x-user-defined x-user-defined
x-viet-tcvn5712 x-viet-tcvn5712
x-viet-vps x-viet-vps
x-windows-949 ks_c_5601-1987, x-windows-949

Table generated from <http://mxr.mozilla.org/mozilla1.9.1/source/intl/uconv/src/charsetalias.properties> (corresponds to Firefox 3.5.2).

Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.

8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.

Bugs: Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.

Internet Explorer

Matching

Strips leading and trailing whitespace and then does ASCII(?) case-insensitive matching. (Matches HTML5.)

Encodings

Encoding (code page) Labels Decoded As Notes
21866 koi8-ru, koi8-u
1142 ccsid01142, cp01142, ebcdic-dk-277+euro, ebcdic-no-277+euro, ibm01142
1140 ccsid01140, cp01140, ebcdic-us-37+euro, ibm01140
1141 ccsid01141, cp01141, ebcdic-de-273+euro, ibm01141
1146 ccsid01146, cp01146, ebcdic-gb-285+euro, ibm01146
1147 ccsid01147, cp01147, ebcdic-fr-297+euro, ibm01147
1144 ccsid01144, cp01144, ebcdic-it-280+euro, ibm01144
1143 ccsid01143, cp01143, ebcdic-fi-278+euro, ebcdic-se-278+euro, ibm01143
1148 ccsid01148, cp01148, ebcdic-international-500+euro, ibm01148
1149 ccsid01149, cp01149, ebcdic-is-871+euro, ibm01149
20273 cp273, csibm273, ibm273
870 cp870, csibm870, ebcdic-cp-roece, ebcdic-cp-yu, ibm870
20277 csibm277, ebcdic-cp-dk, ebcdic-cp-no, ibm277
20278 cp278, csibm278, ebcdic-cp-fi, ebcdic-cp-se, ibm278
1145 ccsid01145, cp01145, ebcdic-es-284+euro, ibm01145
932 csshiftjis, cswindows31j, ms_kanji, shift-jis, shift_jis, sjis, windows-31j, x-ms-cp932, x-sjis
936 chinese, cn-gb, csgb2312, csgb231280, csiso58gb231280, gb2312, gb2312-80, gb231280, gbk, gb_2312-80, iso-ir-58
20905 cp905, csibm905, ebcdic-cp-tr, ibm905
20285 cp285, csibm285, ebcdic-cp-gb, ibm285
20284 cp284, csibm284, ebcdic-cp-es, ibm284
20280 cp280, csibm280, ebcdic-cp-it, ibm280
28596 arabic, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso_8859-6, iso_8859-6:1987
51949 cseuckr, euc-kr, iso-2022-kr-8, iso-2022-kr-8bit
708 asmo-708
20936 x-cp20936
20424 cp424, csibm424, ebcdic-cp-he, ibm424
20420 cp420, csibm420, ebcdic-cp-ar1, ibm420
20423 cp423, csibm423, ebcdic-cp-gr, ibm423
20297 cp297, csibm297, ebcdic-cp-fr, ibm297
20290 cp290, csibm290, ebcdic-jp-kana, ibm290
54936 gb18030
1256 cp1256, windows-1256
1257 windows-1257
1254 windows-1254
1255 windows-1255
1252 windows-1252, x-ansi
1253 windows-1253
1250 windows-1250, x-cp1250
1251 windows-1251, x-cp1251
65000 csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-1-1-utf-7, x-unicode-2-0-utf-7
65001 unicode-1-1-utf-8, unicode-2-0-utf-8, utf-8, x-unicode-1-1-utf-8, x-unicode-2-0-utf-8
10021 x-mac-thai
1258 windows-1258
20924 ccsid00924, cp00924, ebcdic-latin9--euro, ibm00924
20108 norwegian, ns_4551-1, x-ia5-norwegian
20106 din_66003, german, x-ia5-german
20107 sen_850200_b, swedish, x-ia5-swedish
20105 irv, x-ia5
52936 hz-gb-2312
37 cp037, csibm037, ebcdic-cp-ca, ebcdic-cp-nl, ebcdic-cp-us, ebcdic-cp-wt, ibm037
437 437, cp437, cspc8codepage437, ibm437
20880 cp880, csibm880, ebcdic-cyrillic, ibm880
10029 x-mac-ce
10001 x-mac-japanese
855 cp855, ibm855
857 cp857, ibm857
850 cp850, ibm850
852 cp852, ibm852
10003 x-mac-korean
858 ccsid00858, cp00858, cp858, ibm00858, pc-multilingual-850+euro
20833 x-ebcdic-koreanextended
737 ibm737
20838 csibmthai, ibm-thai
500 cp500, csibm500, ebcdic-cp-be, ebcdic-cp-ch, ibm500
20949 x-cp20949
57009 x-iscii-ma
57008 x-iscii-ka
57005 x-iscii-te
57004 x-iscii-ta
57007 x-iscii-or
57006 x-iscii-as
38598 iso-8859-8-i
51936 euc-cn, x-euc-cn
57003 x-iscii-be
57002 x-iscii-de
20005 x-cp20005
20004 x-cp20004
20003 x-cp20003
20002 x-chinese-eten
20001 x-cp20001
20000 x-chinese-cns
10010 x-mac-romanian
10017 x-mac-ukrainian
869 cp869, ibm869
29001 x-europa
1026 cp1026, csibm1026, ibm1026
861 cp861, ibm861
860 cp860, ibm860
863 cp863, ibm863
862 cp862, dos-862, ibm862
865 cp865, ibm865
864 cp864, ibm864
866 cp866, ibm866
720 dos-720
10004 x-mac-arabic
10005 x-mac-hebrew
10006 x-mac-greek
10007 x-mac-cyrillic
10000 macintosh
21027 x-cp21027
10002 x-mac-chinesetrad
21025 cp1025
20866 cskoi8r, koi, koi8, koi8-r, koi8r
10008 x-mac-chinesesimp
1200 iso-10646-ucs-2, ucs-2, unicode, utf-16, utf-16le
57010 x-iscii-gu
57011 x-iscii-pa
50229 x-cp50229
20127 ansi_x3.4-1968, ansi_x3.4-1986, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii
50220 iso-2022-jp
50221 csiso2022jp
50225 csiso2022kr, iso-2022-kr, iso-2022-kr-7, iso-2022-kr-7bit
50227 x-cp50227
28598 csisolatinhebrew, hebrew, iso-8859-8, iso-8859-8 visual, iso-ir-138, iso_8859-8, iso_8859-8:1988, logical, visual
950 big5, big5-hkscs, cn-big5, csbig5, x-x-big5
874 dos-874, iso-8859-11, tis-620, windows-874
875 cp875
10081 x-mac-turkish
10082 x-mac-croatian
51932 cseucpkdfmtjapanese, euc-jp, extended_unix_code_packed_format_for_japanese, iso-2022-jpeuc, x-euc, x-euc-jp
28603 iso-8859-13
28605 csisolatin9, iso-8859-15, iso_8859-15, l9, latin9
775 ibm775
10079 x-mac-icelandic
20871 cp871, csibm871, ebcdic-cp-is, ibm871
28591 cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1
20261 x-cp20261
20269 x-cp20269
28592 csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2
1047 ibm01047
949 csksc56011987, iso-ir-149, korean, ks-c-5601, ks-c5601, ksc5601, ksc_5601, ks_c_5601, ks_c_5601-1987, ks_c_5601-1989, ks_c_5601_1987
1201 unicodefffe, utf-16be
28594 csisolatin4, iso-8859-4, iso-ir-110, iso_8859-4, iso_8859-4:1988, l4, latin4
28599 csisolatin5, iso-8859-9, iso-ir-148, iso_8859-9, iso_8859-9:1989, l5, latin5
50939 cp939
1361 johab
50930 cp930
50931 x-ebcdic-japaneseanduscanada
28593 csisolatin3, iso-8859-3, iso-ir-109, iso_8859-3, iso_8859-3:1988, l3, latin3
50933 cp933
28595 csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso_8859-5, iso_8859-5:1988
50935 cp935
28597 csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso_8859-7, iso_8859-7:1987
50937 cp937

Needs sorting out:

Safari

Matching

UTS22

Encodings

Apple’s reference <http://developer.apple.com/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_encnames/tecmgr_encnames.html#//apple_ref/doc/uid/TP40000932-CH204-TPXREF103> is incomplete or at least severely out of date even as a description of the Text Encoding Conversion Manager (TECM).

Querying TECM programmatically gives labels for more encodings, but the set of encodings does not match what Safari currently supports.

According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.

ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.

Does encoding support in Safari depend on OS and OS version, and will Safari ultimately support everything in ICU 4.2? -- Windows comes with ICU but does not support some of the encodings the Mac edition does.

The version of ICU used by Mac Safari depends on the OS version. For instance, 10.5 (Leopard) comes with ICU 3.6(?) and Snowleopard probably comes with 4.0(?). Windows Safari is currently shipped with ICU 4.0.

Integrate notes from othermaciej in http://krijnhoetmer.nl/irc-logs/whatwg/20090909 Also: http://trac.webkit.org/browser/trunk/WebCore/platform/text

Chrome

Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added.