A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
Web Encodings: Difference between revisions
(→Firefox: Added table (no mappings between encodings)) |
(→Firefox: Added aliases and comments for 8-bit encodings.) |
||
Line 366: | Line 366: | ||
| geostd8 | | geostd8 | ||
| | | | ||
| | | Does not seem to work. | ||
|- | |- | ||
| HZ-GB-2312 | | HZ-GB-2312 | ||
Line 376: | Line 376: | ||
| 850, cp850, csIBM850, ibm850 | | 850, cp850, csIBM850, ibm850 | ||
| | | | ||
| | | csIBM850 not recognised. | ||
|- | |- | ||
| IBM852 | | IBM852 | ||
| 852, cp852, csIBM852, ibm852 | | 852, cp852, csIBM852, ibm852 | ||
| | | | ||
| | | csIBM852 not recognised. | ||
|- | |- | ||
| IBM855 | | IBM855 | ||
| 855, cp855, csIBM855, ibm855 | | 855, cp855, csIBM855, ibm855 | ||
| | | | ||
| | | csIBM855 not recognised. | ||
|- | |- | ||
| IBM857 | | IBM857 | ||
| 857, cp857, csIBM857, ibm857 | | 857, cp857, csIBM857, ibm857 | ||
| | | | ||
| | | csIBM857 not recognised. | ||
|- | |- | ||
| IBM862 | | IBM862 | ||
| 862, cp862, csIBM862, ibm862 | | 862, cp862, csIBM862, ibm862 | ||
| | | | ||
| | | csIBM868 not recognised. | ||
|- | |- | ||
| IBM864 | | IBM864 | ||
| 864, cp864, csIBM864, ibm-864, ibm864 | | 864, cp864, csIBM864, ibm-864, ibm864 | ||
| | | | ||
| | | csIBM864 not recognised. | ||
|- | |- | ||
| IBM864i | | IBM864i | ||
Line 411: | Line 411: | ||
| 866, cp-866, cp866, csIBM866, ibm866 | | 866, cp-866, cp866, csIBM866, ibm866 | ||
| | | | ||
| | | csIBM866 not recognised. | ||
|- | |- | ||
| ISO-2022-CN | | ISO-2022-CN | ||
Line 430: | Line 430: | ||
| ISO-8859-1 | | ISO-8859-1 | ||
| cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 | | cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 | ||
| | | windows-1252 | ||
| | | | ||
|- | |- | ||
Line 440: | Line 440: | ||
| ISO-8859-11 | | ISO-8859-11 | ||
| iso-8859-11, iso8859-11, iso885911 | | iso-8859-11, iso8859-11, iso885911 | ||
| | | windows-874 | ||
| | | | ||
|- | |- | ||
Line 446: | Line 446: | ||
| iso885912 | | iso885912 | ||
| | | | ||
| | | Does not exist. | ||
|- | |- | ||
| ISO-8859-13 | | ISO-8859-13 | ||
Line 555: | Line 555: | ||
| TIS-620 | | TIS-620 | ||
| tis-620, tis620 | | tis-620, tis620 | ||
| | | windows-874 | ||
| | | | ||
|- | |- | ||
| us-ascii | | us-ascii | ||
| 646, ansi_x3.4-1968, ascii, us-ascii | | 646, ansi_x3.4-1968, ascii, us-ascii | ||
| | | windows-1252 | ||
| | | Only the label us-ascii mapped to windows-1252 on *nix, and different behaviour inside iframe and after reload? | ||
|- | |- | ||
| UTF-16 | | UTF-16 | ||
Line 601: | Line 601: | ||
| csviqr | | csviqr | ||
| | | | ||
| | | Usually just an ASCII transcription of Vietnamese. Could potentially be used as an encoding, but does not seem to work. | ||
|- | |- | ||
| VISCII | | VISCII | ||
Line 741: | Line 741: | ||
| csMacintosh, mac, macintosh, x-mac-roman | | csMacintosh, mac, macintosh, x-mac-roman | ||
| | | | ||
| | | csMacintosh not recognised. | ||
|- | |- | ||
| x-mac-romanian | | x-mac-romanian | ||
Line 786: | Line 786: | ||
| x-viet-vni | | x-viet-vni | ||
| | | | ||
| | | Does not seem to work. | ||
|- | |- | ||
| x-viet-vps | | x-viet-vps | ||
Line 800: | Line 800: | ||
Table generated from <http://mxr.mozilla.org/firefox/source/intl/uconv/src/charsetalias.properties>. | Table generated from <http://mxr.mozilla.org/firefox/source/intl/uconv/src/charsetalias.properties>. | ||
Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874. | |||
8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X. | |||
'''Missing:''' Mappings for CJK encodings. Information about the encodings found in charsetalias.properties, but apparently not working as expected. Platform/version differences? | |||
=== Chrome === | === Chrome === |
Revision as of 00:03, 22 August 2009
My scratchpad for encoding related notes.
Goals
- Document existing practices for
- Supported encodings
- Supported aliases
- Supported matching algorithm
- Converge the various used algorithms
- Get the new rules implemented
Current Implementations
Does this differ per platform? Opera might differ a bit on Mac.
Data
Integrate this awesome data somehow:
- http://coq.no/character-tables/mime/en
- http://coq.no/character-tables/mime/iso-2022/en
- http://coq.no/character-tables/mime/euc/en
- http://coq.no/character-tables/mime/locale-specific/en
Opera
Matching
UTS22 and ASCII lowercasing.
Encodings
Encoding | Aliases | Decoded As | Notes |
---|---|---|---|
big5 | big5, cnbig5, csbig5 | ||
big5-hkscs | big5hkscs | ||
euc-jp | cseucpkdfmtjapanese, eucjp, extendedunixcodepackedformatforjapanese | ||
euc-kr | cseuckr, csksc56011987, euckr, isoir149, korean, ksc5601, ksc56011987, ksc56011989, windows949 | ||
euc-tw | euctw | ||
gb18030 | gb18030 | ||
gbk | chinese, cngb, cp936, csgb2312, csiso58gb231280, euccn, gb2312, gb231280, gbk, isoir58, ms936, windows936 | ||
hz-gb-2312 | hzgb2312 | ||
ibm866 | 866, cp866, csibm866, ibm866 | ||
iso-2022-cn | iso2022cn | ||
iso-2022-jp | csiso2022jp, iso2022jp | ||
iso-2022-jp-1 | iso2022jp1 | ||
iso-2022-kr | csiso2022kr, iso2022kr | ||
iso-8859-1 | cp819, csisolatin1, ibm819, iso88591, iso885911987, isoir100, l1, latin1 | windows-1252 | |
iso-8859-2 | csisolatin2, iso88592, iso885921987, isoir101, l2, latin2 | ||
iso-8859-3 | csisolatin3, iso88593, iso885931988, isoir109, l3, latin3 | ||
iso-8859-4 | csisolatin4, iso88594, iso885941988, isoir110, l4, latin4 | ||
iso-8859-5 | csisolatincyrillic, cyrillic, iso88595, iso885951988, isoir144 | ||
iso-8859-6 | arabic, asmo708, csiso88596e, csisolatinarabic, ecma114, iso88596, iso885961987, iso88596e, isoir127 | ||
iso-8859-6-i | csiso88596i, iso88596i | ||
iso-8859-7 | csisolatingreek, ecma118, elot928, greek, greek8, iso88597, iso885971987, isoir126 | ||
iso-8859-8 | csiso88598e, csisolatinhebrew, hebrew, iso88598, iso885981988, iso88598e, isoir138, visual | ||
iso-8859-8-i | csiso88598i, iso88598i | ||
iso-8859-9 | csisolatin5, iso88599, iso885991989, isoir148, l5, latin5 | ||
iso-8859-10 | csisolatin6, iso885910, iso8859101992, isoir157, l6, latin6 | ||
iso-8859-11 | iso885911, tis620, tis6202533, windows874 | ||
iso-8859-13 | iso885913 | ||
iso-8859-14 | iso885914, iso8859141998, isoceltic, isoir199, l8, latin8 | ||
iso-8859-15 | iso885915, latin9 | ||
iso-8859-16 | iso885916, iso8859162001, isoir226, l10, latin10 | ||
koi8-r | cskoi8r, koi8r | ||
koi8-u | koi8u | ||
macintosh | csmacintosh, mac, macintosh, macroman | Likely disabled. | |
shift_jis | cp932, csshiftjis, cswindows31j, ms932, mskanji, shiftjis, sjis, windows31j | ||
tcvn | tcvn, viettcvn | ||
us-ascii | ansix341968, ansix341986, ascii, cp367, csascii, csinvariant, csiso646basic1983, ibm367, invariant, iso646basic1983, iso646irv1991, iso646us, isoir6, ref, us, usascii | windows-1252 | |
utf-16 | csunicode, csunicode11, csunicodeascii, iso10646j1, iso10646ucs2, iso10646ucsbasic, utf16 | ||
utf-16be | utf16be | ||
utf-16le | utf16le | ||
utf-8 | utf8 | ||
viscii | csviscii, viscii | ||
windows-1250 | cp1250, microsoftcp1250, windows1250 | ||
windows-1251 | cp1251, microsoftcp1251, windows1251 | ||
windows-1252 | cp1252, microsoftcp1252, windows1252 | ||
windows-1253 | cp1253, microsoftcp1253, windows1253 | ||
windows-1254 | cp1254, microsoftcp1254, windows1254 | ||
windows-1255 | cp1255, microsoftcp1255, windows1255 | ||
windows-1256 | cp1256, microsoftcp1256, windows1256 | ||
windows-1257 | cp1257, microsoftcp1257, windows1257 | ||
windows-1258 | cp1258, microsoftcp1258, windows1258 | ||
windows-sami-2 | samiws2, windowssami2, ws2 | ||
x-mac-ce | macce | Likely disabled. | |
x-mac-cyrillic | maccyrillic | Likely disabled. | |
x-mac-greek | macgreek | Likely disabled. | |
x-mac-turkish | macturkish | Likely disabled. | |
x-vps | vps |
Firefox
Encoding | Aliases | Decoded As | Notes |
---|---|---|---|
armscii-8 | armscii-8 | ||
Big5 | big5, csbig5, x-x-big5, zh_tw-big5 | ||
Big5-HKSCS | big5-hkscs | ||
EUC-JP | cseucjpkdfmtjapanese, euc-jp, x-euc-jp | ||
EUC-KR | 5601, csksc56011987, csueckr, euc-kr, iso-ir-149, korean, ks_c_5601-1989, ksc5601, ksc_5601 | ||
gb18030 | gb18030 | ||
GB2312 | chinese, csgb2312, csiso58gb231280, gb2312, gb_2312, gb_2312-80, iso-ir-58, zh_cn.euc | ||
GEOSTD8 | geostd8 | Does not seem to work. | |
HZ-GB-2312 | hz-gb-2312 | ||
IBM850 | 850, cp850, csIBM850, ibm850 | csIBM850 not recognised. | |
IBM852 | 852, cp852, csIBM852, ibm852 | csIBM852 not recognised. | |
IBM855 | 855, cp855, csIBM855, ibm855 | csIBM855 not recognised. | |
IBM857 | 857, cp857, csIBM857, ibm857 | csIBM857 not recognised. | |
IBM862 | 862, cp862, csIBM862, ibm862 | csIBM868 not recognised. | |
IBM864 | 864, cp864, csIBM864, ibm-864, ibm864 | csIBM864 not recognised. | |
IBM864i | 864i, cp864i, csibm864i, ibm-864i, ibm864i | ||
IBM866 | 866, cp-866, cp866, csIBM866, ibm866 | csIBM866 not recognised. | |
ISO-2022-CN | iso-2022-cn, iso-2022-cn-ext | ||
ISO-2022-JP | csiso2022jp, csiso2022jp2, iso-2022-jp, iso-2022-jp-2 | ||
ISO-2022-KR | csiso2022kr, iso-2022-kr | ||
ISO-8859-1 | cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso88591, iso_8859-1, l1, latin1 | windows-1252 | |
ISO-8859-10 | csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso885910, l6, latin6 | ||
ISO-8859-11 | iso-8859-11, iso8859-11, iso885911 | windows-874 | |
ISO-8859-12 | iso885912 | Does not exist. | |
ISO-8859-13 | iso-8859-13, iso8859-13, iso885913 | ||
ISO-8859-14 | iso-8859-14, iso8859-14, iso885914 | ||
ISO-8859-15 | iso-8859-15, iso8859-15, iso885915, iso_8859-15 | ||
ISO-8859-16 | iso-8859-16 | ||
ISO-8859-2 | csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso88592, iso_8859-2, l2, latin2 | ||
ISO-8859-3 | csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso88593, iso_8859-3, l3, latin3 | ||
ISO-8859-4 | csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso88594, iso_8859-4, l4, latin4 | ||
ISO-8859-5 | csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso88595, iso_8859-5 | ||
ISO-8859-6 | arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso88596, iso_8859-6 | ||
ISO-8859-6-E | csiso88596e, iso-8859-6-e | ||
ISO-8859-6-I | csiso88596i, iso-8859-6-i | ||
ISO-8859-7 | csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso88597, iso_8859-7, sun_eu_greek | ||
ISO-8859-8 | csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso88598, iso_8859-8, visual | ||
ISO-8859-8-E | csiso88598e, iso-8859-8-e | ||
ISO-8859-8-I | csiso88598i, iso-8859-8-i, iso-8859-8i | ||
ISO-8859-9 | csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso88599, iso_8859-9, l5, latin5 | ||
ISO-IR-111 | csiso111ecmacyrillic, ecma-cyrillic, iso-ir-111 | ||
KOI8-R | koi8-r | ||
KOI8-U | koi8-u | ||
Shift_JIS | csshiftjis, ms_kanji, shift-jis, shift_jis, windows-31j, x-sjis | ||
T.61-8bit | csiso103t618bit, iso-ir-103, t.61, t.61-8bit | ||
TIS-620 | tis-620, tis620 | windows-874 | |
us-ascii | 646, ansi_x3.4-1968, ascii, us-ascii | windows-1252 | Only the label us-ascii mapped to windows-1252 on *nix, and different behaviour inside iframe and after reload? |
UTF-16 | utf-16 | ||
UTF-16BE | csunicode, csunicode11, csunicodeascii, csunicodelatin1, iso-10646, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, utf-16be, x-iso-10646-ucs-2-be | ||
UTF-16LE | utf-16le, x-iso-10646-ucs-2-le | ||
UTF-32BE | iso-10646-ucs-4, utf-32be, x-iso-10646-ucs-4-be | ||
UTF-32LE | utf-32le, x-iso-10646-ucs-4-le | ||
UTF-7 | csunicode11utf7, unicode-1-1-utf-7, unicode-2-0-utf-7, utf-7, x-unicode-2-0-utf-7 | ||
UTF-8 | unicode-1-1-utf-8, utf-8, utf8 | ||
VIQR | csviqr | Usually just an ASCII transcription of Vietnamese. Could potentially be used as an encoding, but does not seem to work. | |
VISCII | csviscii, viscii | ||
windows-1250 | cp1250, windows-1250, x-cp1250 | ||
windows-1251 | ansi-1251, cp1251, windows-1251, x-cp1251 | ||
windows-1252 | cp1252, windows-1252, x-cp1252 | ||
windows-1253 | cp1253, windows-1253, x-cp1253 | ||
windows-1254 | cp1254, windows-1254, x-cp1254 | ||
windows-1255 | cp1255, windows-1255, x-cp1255 | ||
windows-1256 | cp1256, windows-1256, x-cp1256 | ||
windows-1257 | cp1257, windows-1257, x-cp1257 | ||
windows-1258 | cp1258, windows-1258, x-cp1258 | ||
windows-874 | ibm874, windows-874 | ||
windows-936 | windows-936 | ||
x-euc-tw | cns11643, x-euc-tw, zh_tw-euc | ||
x-gbk | gbk, x-gbk | ||
x-imap4-modified-utf7 | x-imap4-modified-utf7 | ||
x-johab | x-johab | ||
x-mac-arabic | x-mac-arabic | ||
x-mac-ce | x-mac-ce | ||
x-mac-croatian | x-mac-croatian | ||
x-mac-cyrillic | x-mac-cyrillic | ||
x-mac-devanagari | x-mac-devanagari | ||
x-mac-farsi | x-mac-farsi | ||
x-mac-greek | x-mac-greek | ||
x-mac-gujarati | x-mac-gujarati | ||
x-mac-gurmukhi | x-mac-gurmukhi | ||
x-mac-hebrew | x-mac-hebrew | ||
x-mac-icelandic | x-mac-icelandic | ||
x-mac-roman | csMacintosh, mac, macintosh, x-mac-roman | csMacintosh not recognised. | |
x-mac-romanian | x-mac-romanian | ||
x-mac-turkish | x-mac-turkish | ||
x-mac-ukrainian | x-mac-ukrainian | ||
x-obsoleted-EUC-JP | x-obsoleted-euc-jp | ||
x-obsoleted-ISO-2022-JP | x-obsoleted-iso-2022-jp | ||
x-obsoleted-Shift_JIS | x-obsoleted-shift_jis | ||
x-user-defined | x-user-defined | ||
x-viet-tcvn5712 | x-viet-tcvn5712 | ||
x-viet-vni | x-viet-vni | Does not seem to work. | |
x-viet-vps | x-viet-vps | ||
x-windows-949 | ks_c_5601-1987, x-windows-949 |
Table generated from <http://mxr.mozilla.org/firefox/source/intl/uconv/src/charsetalias.properties>.
Aliases (used for parsing, apparently not for serialisation) are scattered around in a large number of files; cf. <http://mxr.mozilla.org/firefox/source/intl/uconv/ucvlatin/nsISO885911ToUnicode.cpp> for the mapping from ISO-8859-11 to windows-874.
8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.
Missing: Mappings for CJK encodings. Information about the encodings found in charsetalias.properties, but apparently not working as expected. Platform/version differences?
Chrome
FIXME
Internet Explorer
Needs sorting out:
- http://blogs.msdn.com/shawnste/archive/2009/08/18/alternate-encoding-names-recognized-by-net-ie.aspx
- http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx
Safari
FIXME