A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
StringEncoding: Difference between revisions
(Added References) |
|||
Line 17: | Line 17: | ||
== Open Issues == | == Open Issues == | ||
General: | General: Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String? | ||
=== Scenarios === | === Scenarios === | ||
Line 29: | Line 27: | ||
* Allow arbitrary end byte sequences (e.g. <code>0xFF</code> for UTF-8 strings) | * Allow arbitrary end byte sequences (e.g. <code>0xFF</code> for UTF-8 strings) | ||
* Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count) | * Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count). i.e. do we set the "fatal flag" for | ||
=== API cleanup === | === API cleanup === | ||
Line 38: | Line 37: | ||
* Resolve behavior when writing to a buffer that's too small - partial fill or don't change? | * Resolve behavior when writing to a buffer that's too small - partial fill or don't change? | ||
:''ISSUE: [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] defines the byte order mark as more authoritative than anything else. Is that desirable here?'' | |||
== API == | == API == |
Revision as of 22:21, 14 March 2012
Proposed String Encoding API for Typed Arrays
Editors
- Joshua Bell (Google, Inc)
Abstract
This specification defines an API for encoding strings to binary data, and decoding strings from binary data.
NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.
Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.
Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.
Open Issues
General: Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?
Scenarios
- Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
- Parse/emit legacy data formats that do not use UTF-8
Desired Features
- Allow arbitrary end byte sequences (e.g.
0xFF
for UTF-8 strings) - Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count). i.e. do we set the "fatal flag" for
API cleanup
- Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
Spec Issues
- Resolve behavior when writing to a buffer that's too small - partial fill or don't change?
- ISSUE: Encoding defines the byte order mark as more authoritative than anything else. Is that desirable here?
API
Scripts in pages access the API through the top-level window.stringEncoding
object which holds methods for encoding/decoding strings. Worker scripts can similarly use the self.stringEncoding
object. (Since window
and self
are the page and worker global object, respectively, scripts can simply refer to stringEncoding
without a prefix.)
WebIDL
partial interface Window { readonly attribute StringEncoding stringEncoding; }; partial interface WorkerUtils { readonly attribute StringEncoding stringEncoding; };
The stringEncoding
object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.
WebIDL
interface StringEncoding { DOMString decode(ArrayBufferView view, optional DOMString encoding) raises(DOMException); DOMString stringLength(ArrayBufferView view, optional DOMString encoding) raises(DOMException); unsigned long encode(DOMString value, ArrayBufferView view, optional DOMString encoding) raises(DOMException); unsigned long encodedLength(DOMString value, optional DOMString encoding) raises(DOMException); }
For all methods that take an ArrayBufferView
parameter, the view's byteOffset
and byteLength
attributes are used to offset and limit the encoding/decoding operation against the view's underlying ArrayBuffer buffer
. Per the Typed Array Specification, reading/writing beyond the limits of the view raises an exception.
decode
This method decodes a string from an binary data, using a specified encoding.
The method performs the steps to decode a byte stream from Encoding, with the input stream provided by the byte data in view.buffer starting at offset view.byteOffset with length view.byteLength, and the input label from encoding if specified, "utf-8"
otherwise.
The method returns a DOMString by encoding the stream of code points emitted by the steps as UTF-16 as per WebIDL.
The fatal flag defined in Encoding is not set.
If the decoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.
- NOTE: U+0000 characters have no special meaning and are returned as part of the string.
- ISSUE: Behavior if decoding stops inside a multi-byte sequence.
stringLength
This method determines the length of a "null-terminated" string encoded in binary data, using a specified encoding.
This method performs the steps to decode a byte stream from Encoding, with the input stream provided by the byte data in view.buffer starting at offset view.byteOffset with length view.byteLength, and the input label from encoding if specified, "utf-8"
otherwise.
As soon as the steps emit the code point U+0000 decoding is terminated and the byte pointer within the byte stream is returned. If decoding completes and no U+0000 code point was emitted, -1 is returned.
The fatal flag defined in Encoding is not set.
If the decoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.
- NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence
0x00 0x00
- NOTE: If the encoded string includes a BOM, that is considered part of the length. For example,
stringLength
would return a length of8
for the UTF-16BE sequence0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 0x00 0x00
.
- ISSUE: Add an optional
unsigned short terminator
member, defaults to0
?
- ISSUE: To allow terminators which aren't code points (e.g. 0xFF in UTF-8), make the optional terminator either a code point (default 0) or an Array of octets (e.g. [ 0xFF, 0XFF ] ?
encode
This method encodes a string into binary data, using a specified encoding.
The method performs the steps to encode a code point stream from Encoding, with the input code point stream provided by the DOMString
value, and the input string label from encoding if specified, "utf-8"
otherwise.
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
- ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.
If the encoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.
Otherwise, the output of the encoding steps is a stream of bytes. If the length of the stream is greater than view.byteLength an exception (TBD) is raised
If this method raises an exception for any reason, view.buffer MUST NOT be modified.
- ISSUE: Do we need to specify the case where encoding fails early due to length, but would have failed later due to invalid data?
If this method does not raise an exception, the stream of bytes produced by the encoding steps is written to view.buffer starting at view.byteOffset, and the length of the stream of bytes is returned.
- ISSUE: Would be nice to support "partial fill" and return an object with e.g.
bytesWritten
andcharactersWritten
properties.
encodedLength
This method determines the byte length of an encoded string, using a specified encoding.
The method performs the steps to encode a code point stream from Encoding, with the input code point stream provided by the DOMString
value, and the input string label from encoding if specified, "utf-8"
otherwise.
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
- ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.
If the encoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.
If this method does not raise an exception, the length of the stream of bytes produced by the encoding steps is returned. The stream of bytes itself is not used. Implementations MAY therefore optimize to not produce an actual stream, or determine the length using other means for certain encodings, if the results are indistinguishable from those of performing the steps.
Examples
Example #1 - encoding strings
The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.
function encodeArrayOfStrings(strings) { var len, i, bytes, view, offset; len = Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < strings.length; i += 1) { len += Uint32Array.BYTES_PER_ELEMENT; len += stringEncoding.encodedLength(strings[i], "utf-8"); } bytes = new Uint8Array(len); view = new DataView(bytes.buffer); offset = 0; view.setUint32(offset, strings.length); offset += Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < strings.length; i += 1) { len = stringEncoding.encode(strings[i], new DataView(bytes.buffer, offset + Uint32Array.BYTES_PER_ELEMENT), "utf-8"); view.setUint32(offset, len); offset += Uint32Array.BYTES_PER_ELEMENT + len; } return bytes.buffer; }
Example #2 - decoding strings
The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.
function decodeArrayOfStrings(buffer) { var view, offset, num_strings, strings, i, len; view = new DataView(buffer); offset = 0; strings = []; num_strings = view.getUint32(offset); offset += Uint32Array.BYTES_PER_ELEMENT; for (i = 0; i < num_strings; i += 1) { len = view.getUint32(offset); offset += Uint32Array.BYTES_PER_ELEMENT; strings[i] = stringEncoding.decode(new DataView(buffer, offset, len), "utf-8"); offset += len; } return strings; }
Encodings
Encodings are defined and implemented per Encoding. This implicitly includes the steps to get an encoding from a string, and the logic for label matching and case-insensitivity.
User agents MUST NOT support any other encodings or labels than those defined in Encoding, with the additions defined below.
- ISSUE: Is it a MUST to support all encodings from Encoding spec, or list a subset here?
- ISSUE: Should anything be said about Unicode normalization forms?
- ISSUE: In Encoding, "ascii" is a label for windows-1252; there is no 7-bit-or-raise-exception encoding.
- ISSUE: In Encoding, "iso-8859-1" is a label for windows-1252 for compatibility with existing web content. Is there any reason a new API like this might distinguish them?
- NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by Encoding.
Additional Encodings
The following additional encodings are defined by this specification. They are specific to the methods defined herein.
Name | Labels |
---|---|
binary | "binary" |
binary
The binary encoding is a single-byte encoding where the input code point and output byte are identical for the range U+0000 to U+007f.
- NOTE: This encoding is intended to allow interoperation with legacy code that encodes binary data in ECMAScript strings, for example the WindowBase64 methods atob()/window.btoa() methods. It is recommended that new Web applications use Typed Arrays for transmission and storage of binary data.
If binary is selected as the encoding, then step 3 of the steps to decode a byte stream are skipped.
- NOTE: If
"binary"
is specified, byte order marks must be ignored.
The binary decoder is:
- Let byte be byte pointer.
- If <byte> is the EOF byte, emit the EOF code point.
- Increase the byte pointer by one.
- If byte is in the range 0x00 to 0xFF, emit a code point whose value is byte
- Return failure
The binary encoder is:
- Let code point be the code point pointer
- If code point is the EOF code point, emit the EOF byte.
- Increase the code point pointer by one.
- If code point is in the range U+0000 to U+00FF, emit a byte whose value is code point
- Return failure
References
- WebIDL http://dev.w3.org/2006/webapi/WebIDL
- Encoding http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
Acknowledgements
- Alan Chaney
- Ben Noordhuis
- Glenn Maynard
- John Tamplin
- Kenneth Russell (Google, Inc)
- Robert Mustacchi
- Ryan Dahl
Appendix
A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at: