A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

StringEncoding: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(Updated API from stringEncoding.<method> to TextEncoder/TextDecoder, per WHATWG mailing list)
Line 41: Line 41:
== API ==
== API ==


Scripts in pages access the API through the top-level <code>window.stringEncoding</code> object which holds methods for encoding/decoding strings. Worker scripts can similarly use the <code>self.stringEncoding</code> object. (Since <code>window</code> and <code>self</code> are the page and worker global object, respectively, scripts can simply refer to <code>stringEncoding</code> without a prefix.)
The '''default encoding''' is <code>"utf-8"</code>.
 
=== TextEncoder ===


'''WebIDL'''
'''WebIDL'''
<pre>
<pre>
partial interface Window {
dictionary TextEncodeOptions {
  readonly attribute StringEncoding stringEncoding;
  boolean stream = false;
};
};


partial interface WorkerUtils {
[Constructor,
  readonly attribute StringEncoding stringEncoding;
Constructor(DOMString encoding)]
interface TextEncoder {
  readonly attribute DOMString encoding;
  ArrayBufferView encode(DOMString string, optional TextEncodeOptions options);
};
};
</pre>
</pre>


The <code>stringEncoding</code> object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.
If the constructor is called with no arguments, let <var>encoding</var> be the '''default encoding'''.


'''WebIDL'''
The constructor follows the '''steps to get an encoding''' from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with <var>encoding</var> as <var>label</var>. If the steps result in failure, an <code>DOMException</code> of type <code>EncodingError</code> is thrown. Otherwise, set the encoder object's internal <var>encoding</var> property to the returned encoding. Initialize the internal <var>streaming</var> flag of the encoder object to false.


<pre>
<dl>
interface StringEncoding {
<dt><code>encoding</code> of type DOMString, readonly
<dd>Returns the Name of the encoder object's encoding, per [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding]. Note that this may differ from the name of the encoding specified during the call to the constructor.
</dl>


DOMString decode(ArrayBufferView view,
<dl>
                  optional DOMString encoding)
<dt><code>encode</code>
                        raises(DOMException);
<dd>
The code units within DOMString string are interpreted as UTF-16 code units, to produce a stream of code points stream.


DOMString stringLength(ArrayBufferView view,
:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''
                        optional DOMString encoding)
                                  raises(DOMException);


unsigned long encode(DOMString value,
This method runs the steps encoder algorithm of the object's encoder over the code point stream.
                      ArrayBufferView view,
                      optional DOMString encoding)
                          raises(DOMException);


unsigned long encodedLength(DOMString value,
If the internal <var>streaming</var> flag is not set, then the encoder algorithm's state (flags, etc) is reset prior to performing the steps. Otherwise, the encoder algorithm's state is re-used from the previous call to <code>encode</code> on this object. After the above inspection of the internal <var>streaming</var> flag, if the <var>options</var> parameter is specified and the '''stream''' option is '''true''', then the internal <var>streaming</var> flag is set; otherwise the internal <var>streaming</var> flag is cleared.
                            optional DOMString encoding)
                                raises(DOMException);
}
</pre>


For all methods that take an <code>ArrayBufferView</code> parameter, the view's <code>byteOffset</code> and <code>byteLength</code> attributes are used to offset and limit the encoding/decoding operation against the view's underlying ArrayBuffer <code>buffer</code>. Per the Typed Array Specification, reading/writing beyond the limits of the view raises an exception.
If the <var>options</var> parameter not specified or the '''stream''' option is false, then after final code point is yielded by the stream then the '''EOF code point''' is yielded.


=== <code>decode</code> ===
The method returns a <code>Unit8Array</code> object wrapping an <code>ArrayBuffer</code> containing the bytes emitted by encoder algorithm.
</dl>


This method decodes a string from an binary data, using a specified encoding.
=== TextDecoder ===


The method performs the <em>steps to decode a byte stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input <var>stream</var> provided by the byte data in <var>view.buffer</var> starting at offset <var>view.byteOffset</var> with length <var>view.byteLength</var>, and the input <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
'''WebIDL'''
<pre>
dictionary TextDecodeOptions {
  boolean fatal = false;
  boolean nullTerminator = false;
  boolean stream = false;
};


The method returns a DOMString by encoding the stream of code points emitted by the steps as UTF-16 as per [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL].
[Constructor,
 
Constructor(DOMString encoding)]
The <em>fatal flag</em> defined in [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] is not set.
interface TextDecoder {
 
  readonly attribute DOMString encoding;
If the decoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
  DOMString decode(ArrayBufferView view, optional TextDecodeOptions options);
 
};
:''NOTE: '''U+0000''' characters have no special meaning and are returned as part of the string.''
</pre>
 
:''ISSUE: Behavior if decoding stops inside a multi-byte sequence.''
 
=== <code>stringLength</code> ===


This method determines the length of a "null-terminated" string encoded in binary data, using a specified encoding.
If the constructor is called with no arguments, let <var>encoding</var> be the '''default encoding'''.


This method performs the <em>steps to decode a byte stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input <var>stream</var> provided by the byte data in <var>view.buffer</var> starting at offset <var>view.byteOffset</var> with length <var>view.byteLength</var>, and the input <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.  
The constructor follows the steps to get an encoding from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with encoding as label. If the steps result in failure, a DOMException of type EncodingError is thrown. Otherwise, set the decoder object's internal encoding property to the returned encoding. Initialize the internal streaming flag of the decoder object to false.


As soon as the steps emit the code point '''U+0000''' decoding is terminated and the <em>byte pointer</em> within the byte stream is returned. If decoding completes and no '''U+0000''' code point was emitted, -1 is returned.
<dl>
<dt><code>encoding</code> of type DOMString, readonly
<dd>
Returns the Name of the decoder object's encoding, per [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding]. Note that this may differ from the name of the encoding specified during the call to the constructor.


The <em>fatal flag</em> defined in [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] is not set.
<dt><code>decode</code>
<dd>
This method runs the decoder algorithm of the object's encoder over the byte stream from <var>view.buffer</var> starting at offset <var>view.byteOffset</var>. A maximum of <var>view.byteLength</var> bytes are yielded by the stream from <var>view.buffer</var>.


If the decoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
:''ISSUE: Need to handle BOMs. [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.''


:''NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence <code>0x00 0x00</code>''
If the internal <var>streaming</var> flag is not set, then the decoder algorithm's state (flags, etc) is reset prior to performing the steps. Otherwise, the decoder algorithm's state is re-used from the previous call to <code>decode</code> on this object. After the above inspection of the internal <var>streaming</var> flag, if the <var>options</var> parameter is specified and the '''stream''' option is '''true''', then the internal <var>streaming</var> flag is set; otherwise the internal <var>streaming</var> flag is cleared.


:''NOTE: If the encoded string includes a BOM, that is considered part of the length. For example, <code>stringLength</code> would return a length of <code>8</code> for the UTF-16BE sequence <code>0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 0x00 0x00</code>.
If the <var>options</var> parameter not specified or the '''stream''' option is '''false''', then after <var>view.byteLength</var> bytes are yielded by the stream the '''EOF byte''' is yielded.


:''ISSUE: Add an optional <code>unsigned short terminator</code> member, defaults to <code>0</code>?
If the <var>options</var> parameter is specified and the '''fatal''' option is '''true''', then a '''decoder error''' causes an <code>DOMException</code> of type <code>EncodingError</code> to be thrown rather than emitting a code point.


:''ISSUE: To allow terminators which aren't code points (e.g. 0xFF in UTF-8), make the optional terminator either a code point (default 0) or an Array of octets (e.g. [ 0xFF, 0XFF ] ?''
If the <var>options</var> parameter is specified and the '''nullTerminator''' option is '''true''', then if the decoder algorithm emits a code point of '''U+0000''' then code point is not included in the code point stream and the decoder algorithm is halted.


=== <code>encode</code> ===
:''NOTE: The byte sequence representing terminator is encoding specific. For example, in '''UTF-16''' encodings it would be the even-aligned two-byte sequence '''0x00 0x00'''.''
 
This method encodes a string into binary data, using a specified encoding.
 
The method performs the <em>steps to encode a code point stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input code point <var>stream</var> provided by the <code>DOMString</code> <var>value</var>, and the input string <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
 
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
 
:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''
 
If the encoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
 
Otherwise, the output of the encoding steps is a stream of bytes. If the length of the stream is greater than <var>view.byteLength</var> an exception ''(TBD)'' is raised
 
If this method raises an exception for any reason, <var>view.buffer</var> MUST NOT be modified.
 
:''ISSUE: Do we need to specify the case where encoding fails early due to length, but would have failed later due to invalid data?''
 
If this method does not raise an exception, the stream of bytes produced by the encoding steps is written to <var>view.buffer</var> starting at <var>view.byteOffset</var>, and the length of the stream of bytes is returned.
 
:''ISSUE: Would be nice to support "partial fill" and return an object with e.g. <code>bytesWritten</code> and <code>charactersWritten</code> properties.''
 
=== <code>encodedLength</code> ===
 
This method determines the byte length of an encoded string, using a specified encoding.
 
The method performs the <em>steps to encode a code point stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input code point <var>stream</var> provided by the <code>DOMString</code> <var>value</var>, and the input string <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
 
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
 
:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''


If the encoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
Once the algorithm has no more bytes to process, the method returns a <code>DOMString</code> by encoding the stream of code points emitted by the steps as UTF-16 as per [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL].


If this method does not raise an exception, the length of the stream of bytes produced by the encoding steps is returned. The stream of bytes itself is not used. Implementations MAY therefore optimize to not produce an actual stream, or determine the length using other means for certain encodings, if the results are indistinguishable from those of performing the steps.
:''ISSUE: Rather than nullTerminator, allow specifying terminators, either by byte sequence (e.g. for terminators which are not valid in a given encoding (e.g. 0xFF in UTF-8) or by code point?''
</dl>


== Examples ==
== Examples ==

Revision as of 21:19, 26 March 2012

Proposed String Encoding API for Typed Arrays

Editors

  • Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.

Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.

Open Issues

General: Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?

Scenarios

  • Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
  • Parse/emit legacy data formats that do not use UTF-8

Desired Features

  • Allow arbitrary end byte sequences (e.g. 0xFF for UTF-8 strings)
  • Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count). i.e. do we set the "fatal flag" for


API cleanup

  • Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer

Spec Issues

  • Resolve behavior when writing to a buffer that's too small - partial fill or don't change?
ISSUE: Encoding defines the byte order mark as more authoritative than anything else. Is that desirable here?

API

The default encoding is "utf-8".

TextEncoder

WebIDL

dictionary TextEncodeOptions {
  boolean stream = false;
};

[Constructor,
 Constructor(DOMString encoding)]
interface TextEncoder {
  readonly attribute DOMString encoding;
  ArrayBufferView encode(DOMString string, optional TextEncodeOptions options);
};

If the constructor is called with no arguments, let encoding be the default encoding.

The constructor follows the steps to get an encoding from Encoding, with encoding as label. If the steps result in failure, an DOMException of type EncodingError is thrown. Otherwise, set the encoder object's internal encoding property to the returned encoding. Initialize the internal streaming flag of the encoder object to false.

encoding of type DOMString, readonly
Returns the Name of the encoder object's encoding, per Encoding. Note that this may differ from the name of the encoding specified during the call to the constructor.
encode
The code units within DOMString string are interpreted as UTF-16 code units, to produce a stream of code points stream.
ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.
This method runs the steps encoder algorithm of the object's encoder over the code point stream. If the internal streaming flag is not set, then the encoder algorithm's state (flags, etc) is reset prior to performing the steps. Otherwise, the encoder algorithm's state is re-used from the previous call to encode on this object. After the above inspection of the internal streaming flag, if the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared. If the options parameter not specified or the stream option is false, then after final code point is yielded by the stream then the EOF code point is yielded. The method returns a Unit8Array object wrapping an ArrayBuffer containing the bytes emitted by encoder algorithm.

TextDecoder

WebIDL

dictionary TextDecodeOptions {
  boolean fatal = false;
  boolean nullTerminator = false;
  boolean stream = false;
};

[Constructor,
 Constructor(DOMString encoding)]
interface TextDecoder {
  readonly attribute DOMString encoding;
  DOMString decode(ArrayBufferView view, optional TextDecodeOptions options);
};

If the constructor is called with no arguments, let encoding be the default encoding.

The constructor follows the steps to get an encoding from Encoding, with encoding as label. If the steps result in failure, a DOMException of type EncodingError is thrown. Otherwise, set the decoder object's internal encoding property to the returned encoding. Initialize the internal streaming flag of the decoder object to false.

encoding of type DOMString, readonly
Returns the Name of the decoder object's encoding, per Encoding. Note that this may differ from the name of the encoding specified during the call to the constructor.
decode
This method runs the decoder algorithm of the object's encoder over the byte stream from view.buffer starting at offset view.byteOffset. A maximum of view.byteLength bytes are yielded by the stream from view.buffer.
ISSUE: Need to handle BOMs. Encoding specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.
If the internal streaming flag is not set, then the decoder algorithm's state (flags, etc) is reset prior to performing the steps. Otherwise, the decoder algorithm's state is re-used from the previous call to decode on this object. After the above inspection of the internal streaming flag, if the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared. If the options parameter not specified or the stream option is false, then after view.byteLength bytes are yielded by the stream the EOF byte is yielded. If the options parameter is specified and the fatal option is true, then a decoder error causes an DOMException of type EncodingError to be thrown rather than emitting a code point. If the options parameter is specified and the nullTerminator option is true, then if the decoder algorithm emits a code point of U+0000 then code point is not included in the code point stream and the decoder algorithm is halted.
NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-byte sequence 0x00 0x00.
Once the algorithm has no more bytes to process, the method returns a DOMString by encoding the stream of code points emitted by the steps as UTF-16 as per WebIDL.
ISSUE: Rather than nullTerminator, allow specifying terminators, either by byte sequence (e.g. for terminators which are not valid in a given encoding (e.g. 0xFF in UTF-8) or by code point?

Examples

Example #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings) {
  var len, i, bytes, view, offset;

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    len += stringEncoding.encodedLength(strings[i], "utf-8");
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len = stringEncoding.encode(strings[i], 
                                new DataView(bytes.buffer, offset + Uint32Array.BYTES_PER_ELEMENT),
                                "utf-8");
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT + len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer) {
  var view, offset, num_strings, strings, i, len;

  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = stringEncoding.decode(new DataView(buffer, offset, len), 
                                       "utf-8");
    offset += len;
  }
  return strings;
}

Encodings

Encodings are defined and implemented per Encoding. This implicitly includes the steps to get an encoding from a string, and the logic for label matching and case-insensitivity.

User agents MUST NOT support any other encodings or labels than those defined in Encoding, with the additions defined below.

ISSUE: Is it a MUST to support all encodings from Encoding spec, or list a subset here?
ISSUE: Should anything be said about Unicode normalization forms?
ISSUE: In Encoding, "ascii" is a label for windows-1252; there is no 7-bit-or-raise-exception encoding.
ISSUE: In Encoding, "iso-8859-1" is a label for windows-1252 for compatibility with existing web content. Is there any reason a new API like this might distinguish them?
NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by Encoding.

Additional Encodings

The following additional encodings are defined by this specification. They are specific to the methods defined herein.

NameLabels
binary"binary"

binary

The binary encoding is a single-byte encoding where the input code point and output byte are identical for the range U+0000 to U+00ff.

NOTE: This encoding is intended to allow interoperation with legacy code that encodes binary data in ECMAScript strings, for example the WindowBase64 methods atob()/btoa() methods. It is recommended that new Web applications use Typed Arrays for transmission and storage of binary data.

If binary is selected as the encoding, then step 3 of the steps to decode a byte stream are skipped.

NOTE: If "binary" is specified, byte order marks must be ignored.

The binary decoder is:

  1. Let byte be byte pointer.
  2. If <byte> is the EOF byte, emit the EOF code point.
  3. Increase the byte pointer by one.
  4. If byte is in the range 0x00 to 0xFF, emit a code point whose value is byte
  5. Return failure

The binary encoder is:

  1. Let code point be the code point pointer
  2. If code point is the EOF code point, emit the EOF byte.
  3. Increase the code point pointer by one.
  4. If code point is in the range U+0000 to U+00FF, emit a byte whose value is code point
  5. Return failure

References


Acknowledgements

  • Alan Chaney
  • Ben Noordhuis
  • Glenn Maynard
  • John Tamplin
  • Kenneth Russell (Google, Inc)
  • Robert Mustacchi
  • Ryan Dahl

Appendix

A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:

http://code.google.com/p/stringencoding/