A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
StringEncoding: Difference between revisions
Line 29: | Line 29: | ||
* Streaming decode/encode requires retaining partial buffers between calls. | * Streaming decode/encode requires retaining partial buffers between calls. | ||
** Some encode/decode algorithms require adjusting the <var>code point pointer</var> or <var>byte pointer</var> by a negative amount. This could occur across "chunk" boundaries. This implies that when the internal <var>streaming</var> flag is set on an encoder/decoder that the last N elements of the stream are saved for the next call and used as a prefix for the stream. N is defined by the specific encoding algorithm. | ** Some encode/decode algorithms require adjusting the <var>code point pointer</var> or <var>byte pointer</var> by a negative amount. This could occur across "chunk" boundaries. This implies that when the internal <var>streaming</var> flag is set on an encoder/decoder that the last N elements of the stream are saved for the next call and used as a prefix for the stream. N is defined by the specific encoding algorithm. | ||
** This is not yet implemented in the [http://code.google.com/p/stringencoding/ ECMAScript shim] | |||
* Move the algorithm to <b><i>convert a sequence of Unicode characters to a DOMString</i></b> into [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL]. | * Move the algorithm to <b><i>convert a sequence of Unicode characters to a DOMString</i></b> into [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL]. | ||
Revision as of 17:38, 27 June 2012
Proposed Text Encoding Web API for Typed Arrays
Editors
- Joshua Bell (Google, Inc)
Abstract
This specification defines an API for encoding strings to binary data, and decoding strings from binary data.
NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.
Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.
Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.
Open Issues
- Encoding errors
- Define options dict for
TextEncoder()
that allows selecting between throw vs. replacement character / replacement callback?
- Define options dict for
- The Encoding specification defines the byte order mark as more authoritative than an explicit encoding label when decoding content.
- Is that desirable here?
- What does it mean to e.g. write
TextDecoder('iso-2022-kr').decode(str)
if string turns out to be UTF-8? - What about streaming and other re-uses of the same decoder object?
- At the very least, a matching BOM should be ignored (i.e. not returned as part of the output) and a mismatching BOM should signal an error.
Notes and TODOs
- Streaming decode/encode requires retaining partial buffers between calls.
- Some encode/decode algorithms require adjusting the code point pointer or byte pointer by a negative amount. This could occur across "chunk" boundaries. This implies that when the internal streaming flag is set on an encoder/decoder that the last N elements of the stream are saved for the next call and used as a prefix for the stream. N is defined by the specific encoding algorithm.
- This is not yet implemented in the ECMAScript shim
- Move the algorithm to convert a sequence of Unicode characters to a DOMString into WebIDL.
Resolved Issues
- Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
- Resolution: not for "v1" - can be implemented using this API, and breaking the string down by "character" is unlikely to be as obvious as it sounds - surrogate pairs, combining sequences, etc
- Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
- Resolution: See above, and wait for developer feedback.
- Allow arbitrary end byte sequences (e.g.
0xFF
for UTF-8 strings) - Resolution: Add
indexOf
toArrayBufferView
- Remove binary encoding? (a proposed 8-bit-clean encoding for interop with legacy binary data stored in ECMAScript strings)
- The only real use is with
atob()
/btoa()
; a better API would be Base64 directly in/out of Typed Arrays. Consensus on WHATWG is to add better APIs e.g.partial interface ArrayBufferView { DOMString toBase64(); }; partial interface ArrayBuffer { static ArrayBuffer fromBase64(DOMString string); };
- Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?
- There seems to be pretty strong consensus that streaming, stateful coding is high priority and that an object-oriented API is cleanest, and shoe-horning those onto existing objects would be messy.
API
TextEncoder
WebIDL
dictionary TextEncodeOptions { boolean stream = false; }; [Constructor, Constructor(DOMString encoding)] interface TextEncoder { readonly attribute DOMString encoding; ArrayBufferView encode(DOMString? string, optional TextEncodeOptions options); };
The constructor runs the following steps:
- If the constructor is called with no arguments, let label be the "
utf-8
". Otherwise, let label be the value of the encoding argument. - Run the steps to get an encoding from Encoding, with label as label.
- If the steps result in failure, throw an "
EncodingError
" exception and terminate these steps. - Otherwise, set the encoder object's internal encoding property to the returned encoding.
- If the steps result in failure, throw an "
- Initialize the internal streaming flag of the encoder object to false.
- Initialize the internal encoding algorithm state to the default values for the encoding encoding.
encoding
of type DOMString, readonly- Returns the Name of the encoder object's encoding, per Encoding.
- Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of
"ascii"
the encoding attribute of the encoder object would have the value"windows-1252"
as"ascii"
is a label for that encoding.
- Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of
encode
-
The
encode
method runs these steps:- If the internal streaming flag is not set, then reset the encoding algorithm state to the default values for encoding. Otherwise, the encoding algorithm state is re-used from the previous call to
encode
on this object. - If the options parameter is specified and the
stream
option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared. - Run the steps of the encoding algorithm:
- The input to the algorithm is a stream of code points. The stream is composed of the Unicode code points for the Unicode characters produced by following the steps to convert a DOMString to a sequence of Unicode characters in WebIDL with string as the input. If string is null, the stream of code points is empty.
- If the options parameter not specified or the
stream
option is false, then after final code point is yielded by the stream then the EOF code point is yielded. - The output of the the algorithm is a sequence of emitted bytes.
- Returns a
Unit8Array
object wrapping anArrayBuffer
containing the sequence of emitted bytes by encoder algorithm.
- If the internal streaming flag is not set, then reset the encoding algorithm state to the default values for encoding. Otherwise, the encoding algorithm state is re-used from the previous call to
TextDecoder
WebIDL
dictionary TextDecoderOptions { boolean fatal = false; }; dictionary TextDecodeOptions { boolean stream = false; }; [Constructor, Constructor(optional DOMString encoding, optional TextDecoderOptions options)] interface TextDecoder { readonly attribute DOMString encoding; DOMString decode(optional ArrayBufferView view, optional TextDecodeOptions options); };
The constructor runs the following steps:
- The constructor creates a decoder object. It has the internal properties encoding, a fatal flag which is initially unset, and a streaming flag which is initially unset.
- If called without an encoding argument, let label be the "
utf-8
". Otherwise, let label be the value of the encoding argument. - Run the steps to get an encoding from Encoding, with label as label.
- If the steps result in failure, throw a "
EncodingError
" exception and terminate these steps. - Otherwise, set the decoder object's internal encoding property to the returned encoding.
- If the steps result in failure, throw a "
- If the constructor is called with an options argument, and the fatal property of the dictionary is set, set the internal fatal flag of the decoder object.
- Initialize the internal encoding algorithm state to the default values for the encoding encoding.
- Return the decoder object.
encoding
of type DOMString, readonly-
Returns the Name of the decoder object's encoding, per Encoding.
- Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of
"ascii"
the encoding attribute of the decoder object would have the value"windows-1252"
as"ascii"
is a label for that encoding.
- Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of
decode
-
The
decode
method runs the following steps:- If the internal streaming flag of the decoder object is not set, then reset the encoding algorithm state to the default values for encoding encoding. Otherwise, the encoding algorithm state is re-used from the previous call to
decode
on this object. - If the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared.
- Run the decoder algorithm of the decoder object's encoder
- The input to the algorithm is a byte stream. The byte stream is provided by the bytes in view.buffer starting at offset view.byteOffset. A maximum of view.byteLength bytes are yielded by the stream from view.buffer. If view is not specified, the stream is empty.
- If the options parameter not specified or the stream option is false, then after view.byteLength bytes are yielded by the stream the EOF byte is yielded.
- If the internal fatal flag of the decoder object is set, then a decoder error causes an
DOMException
of typeEncodingError
to be thrown rather than emitting a fallback code point. - The output of the algorithm is a sequence of emitted code points.
- Return a
DOMString
by encoding the sequence of emitted code points following the steps to convert a sequence of Unicode characters to a DOMString.
- ISSUE: Need to handle BOMs. Encoding specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.
- If the internal streaming flag of the decoder object is not set, then reset the encoding algorithm state to the default values for encoding encoding. Otherwise, the encoding algorithm state is re-used from the previous call to
Algorithms
Steps to convert a sequence of Unicode characters to a DOMString
- TODO: Move the following to WebIDL
The following algorithm defines a way to convert a sequence of Unicode characters to a DOMString:
- Let U0...n-1 be the sequence of Unicode characters
- Initialize i to 0
- Initialize S to be an empty sequence of code units
- While i < n
- Let c be the code point of the Unicode character in U at index i
- If c ≥ 216, then:
- Otherwise, append to S a code unit equal to c.
- Set i to i+1
- Return the IDL DOMString value that represents sequence of code units S.
Examples
Example #1 - encoding strings
The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.
function encodeArrayOfStrings(strings, encoding) { var encoder, encoded, len, i, bytes, view, offset; encoder = TextEncoder(encoding); encoded = []; len = Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < strings.length; i += 1) { len += Uint32Array.BYTES_PER_ELEMENT; encoded[i] = TextEncoder(encoding).encode(strings[i]); len += encoded[i].byteLength; } bytes = new Uint8Array(len); view = new DataView(bytes.buffer); offset = 0; view.setUint32(offset, strings.length); offset += Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < encoded.length; i += 1) { len = encoded[i].byteLength; view.setUint32(offset, len); offset += Uint32Array.BYTES_PER_ELEMENT; bytes.set(encoded[i], offset); offset += len; } return bytes.buffer; }
Example #2 - decoding strings
The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.
function decodeArrayOfStrings(buffer, encoding) { var decoder, view, offset, num_strings, strings, i, len; decoder = TextDecoder(encoding); view = new DataView(buffer); offset = 0; strings = []; num_strings = view.getUint32(offset); offset += Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < num_strings; i += 1) { len = view.getUint32(offset); offset += Uint32Array.BYTES_PER_ELEMENT; strings[i] = decoder.decode( new DataView(view.buffer, offset, len)); offset += len; } return strings; }
Encodings
Encodings are defined and implemented per Encoding. This implicitly includes the steps to get an encoding from a string, and the logic for label matching and case-insensitivity.
User agents MUST NOT support any other encodings or labels than those defined in Encoding, and MUST support all encodings and labels defined in that specification, with the additions defined below.
- NOTE: In Encoding, "ascii" is a label for windows-1252; there is no 7-bit-or-raise-exception encoding. Applications that are required to restrict the content of decoded strings should implement validation after decoding.
- NOTE: Unicode normalization forms are outside the scope of this specification. No normalization is done prior to encoding or after decoding.
- NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by Encoding.
References
- WebIDL http://dev.w3.org/2006/webapi/WebIDL
- Encoding http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
Acknowledgements
- Alan Chaney
- Ben Noordhuis
- Glenn Maynard
- John Tamplin
- Kenneth Russell (Google, Inc)
- Robert Mustacchi
- Ryan Dahl
- Anne van Kesteren
Appendix
A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at: