A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
Encoding: Difference between revisions
No edit summary |
|||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
This page tracks notes related to the [http:// | This page tracks notes related to the [http://encoding.spec.whatwg.org/ Encoding Standard]. See [[Web Encodings]] for some historical data with respect to encodings and their labels. | ||
==Implementations== | ==Implementations== | ||
http://code.google.com/p/stringencoding/ implements the standard in JavaScript | * http://code.google.com/p/stringencoding/ implements the standard in JavaScript | ||
* http://coq.no/darcsweb/?r=flex-encoding is an experimental (f)lex implementation of the decoding algorithms | |||
==Legacy implementations== | ==Legacy implementations== | ||
Line 24: | Line 25: | ||
* http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html | * http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html | ||
* http://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/ (Gecko; tests are a few directories up) | |||
* http://code.google.com/p/juniversalchardet/ (Gecko's detector in Java) | * http://code.google.com/p/juniversalchardet/ (Gecko's detector in Java) | ||
* https://bugzilla.mozilla.org/show_bug.cgi?id=631751 (UTF-16 sniffing) | * https://bugzilla.mozilla.org/show_bug.cgi?id=631751 (UTF-16 sniffing in Gecko) | ||
* http://trac.webkit.org/browser/trunk/Source/WebCore/loader/TextResourceDecoder.cpp (WebKit; is this all?) | * http://trac.webkit.org/browser/trunk/Source/WebCore/loader/TextResourceDecoder.cpp (WebKit; is this all?) | ||
===Gecko notes=== | |||
* Thai detection was not enabled [https://bugzilla.mozilla.org/show_bug.cgi?id=488426 before June 2009] | |||
* Encodings not in the spec: iso-2022-cn, euc-tw | |||
* GB2312 is supported, but superset gbk/gb18030 is not | |||
* UTF-16 sniffing is HTML specific | |||
==Misc== | ==Misc== | ||
Line 35: | Line 44: | ||
==Labels== | ==Labels== | ||
Labels in Opera that are not in the spec: | Labels in Opera that are not in the spec: 'ansi_x3.4-1986', 'cn-gb', 'cp367', 'cp50220', 'cp51932', 'cp932', 'cp936', 'csascii', 'cscp50220', 'cscp51932', 'csinvariant', 'csiso646basic1983', 'csunicode', 'csunicode11', 'csunicode11utf7', 'csunicodeascii', 'csunicodejapanese', 'csunicodelatin1', 'csviscii', 'cswindows31j', 'euc-cn', 'euc-tw', 'extended_unix_code_packed_format_for_japanese', 'ibm367', 'invariant', 'iso-10646', 'iso-10646-j-1', 'iso-10646-ucs-2', 'iso-10646-ucs-basic', 'iso-10646-unicode-latin1', 'iso-2022-cn', 'iso-2022-jp-1', 'iso-celtic', 'iso-ir-199', 'iso-ir-226', 'iso-ir-6', 'iso646-us', 'iso8859-16', 'iso885916', 'iso_646.basic:1983', 'iso_646.irv:1991', 'iso_8859-10:1992', 'iso_8859-14', 'iso_8859-14:1998', 'iso_8859-16', 'iso_8859-16:2001', 'iso_8859-6-e', 'iso_8859-6-i', 'iso_8859-8-e', 'iso_8859-8-i', 'l10', 'l8', 'latin-9', 'latin10', 'latin8', 'microsoft-cp1250', 'microsoft-cp1251', 'microsoft-cp1252', 'microsoft-cp1253', 'microsoft-cp1254', 'microsoft-cp1255', 'microsoft-cp1256', 'microsoft-cp1257', 'microsoft-cp1258', 'ms932', 'ms936', 'ref', 'tis-620-2533', 'unicode-1-1', 'unicode-1-1-utf-7', 'us', 'utf-7', 'viscii', 'windows-936', 'x-mac-ce', 'x-mac-greek', 'x-mac-turkish', 'x-user-defined' | ||
[[Category:Spec_coordination]] | [[Category:Spec_coordination]] |
Latest revision as of 10:31, 7 August 2013
This page tracks notes related to the Encoding Standard. See Web Encodings for some historical data with respect to encodings and their labels.
Implementations
- http://code.google.com/p/stringencoding/ implements the standard in JavaScript
- http://coq.no/darcsweb/?r=flex-encoding is an experimental (f)lex implementation of the decoding algorithms
Legacy implementations
- Gecko
- http://mxr.mozilla.org/mozilla-central/source/intl/uconv/
- Chromium
- http://src.chromium.org/svn/trunk/deps/third_party/icu46/README.chromium
- http://src.chromium.org/svn/trunk/deps/third_party/icu46/source/data/mappings/convrtrs.txt
- http://src.chromium.org/svn/trunk/deps/third_party/icu46/source/data/mappings/ucmlocal.mk
Japanese encodings
Got these links after the standard was written:
- http://www8.plala.or.jp/tkubota1/unicode-symbols.html
- http://www8.plala.or.jp/tkubota1/unicode-symbols-map2.html
Sniffing
- http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
- http://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/ (Gecko; tests are a few directories up)
- http://code.google.com/p/juniversalchardet/ (Gecko's detector in Java)
- https://bugzilla.mozilla.org/show_bug.cgi?id=631751 (UTF-16 sniffing in Gecko)
- http://trac.webkit.org/browser/trunk/Source/WebCore/loader/TextResourceDecoder.cpp (WebKit; is this all?)
Gecko notes
- Thai detection was not enabled before June 2009
- Encodings not in the spec: iso-2022-cn, euc-tw
- GB2312 is supported, but superset gbk/gb18030 is not
- UTF-16 sniffing is HTML specific
Misc
- XSS vulnerabilities with unusual character encodings
- Test multiple BOMs
Labels
Labels in Opera that are not in the spec: 'ansi_x3.4-1986', 'cn-gb', 'cp367', 'cp50220', 'cp51932', 'cp932', 'cp936', 'csascii', 'cscp50220', 'cscp51932', 'csinvariant', 'csiso646basic1983', 'csunicode', 'csunicode11', 'csunicode11utf7', 'csunicodeascii', 'csunicodejapanese', 'csunicodelatin1', 'csviscii', 'cswindows31j', 'euc-cn', 'euc-tw', 'extended_unix_code_packed_format_for_japanese', 'ibm367', 'invariant', 'iso-10646', 'iso-10646-j-1', 'iso-10646-ucs-2', 'iso-10646-ucs-basic', 'iso-10646-unicode-latin1', 'iso-2022-cn', 'iso-2022-jp-1', 'iso-celtic', 'iso-ir-199', 'iso-ir-226', 'iso-ir-6', 'iso646-us', 'iso8859-16', 'iso885916', 'iso_646.basic:1983', 'iso_646.irv:1991', 'iso_8859-10:1992', 'iso_8859-14', 'iso_8859-14:1998', 'iso_8859-16', 'iso_8859-16:2001', 'iso_8859-6-e', 'iso_8859-6-i', 'iso_8859-8-e', 'iso_8859-8-i', 'l10', 'l8', 'latin-9', 'latin10', 'latin8', 'microsoft-cp1250', 'microsoft-cp1251', 'microsoft-cp1252', 'microsoft-cp1253', 'microsoft-cp1254', 'microsoft-cp1255', 'microsoft-cp1256', 'microsoft-cp1257', 'microsoft-cp1258', 'ms932', 'ms936', 'ref', 'tis-620-2533', 'unicode-1-1', 'unicode-1-1-utf-7', 'us', 'utf-7', 'viscii', 'windows-936', 'x-mac-ce', 'x-mac-greek', 'x-mac-turkish', 'x-user-defined'