A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

StringEncoding: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
mNo edit summary
 
(82 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Proposed String Encoding API for Typed Arrays
{{obsolete|spec=[https://encoding.spec.whatwg.org/#api Encoding Standard: API]}}
 
Proposed Text Encoding Web API for Typed Arrays


== Editors ==
== Editors ==
Line 17: Line 19:
== Open Issues ==
== Open Issues ==


General: Rewrite in terms of [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] including
* Should the <var>encoding</var> attribute return the <var>Name</var> of the encoding or the name that was passed in?
* Depending on API, setting "fatal flag" vs. using fallbacks
:''ISSUE: [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] defines the byte order mark as more authoritative than anything else. Is that desirable here?''
 
=== Scenarios ===
 
* Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
* Parse/emit legacy data formats that do not use UTF-8


=== Desired Features ===
== Notes to Implementers ==


* Allow arbitrary end byte sequences (e.g. <code>0xFF</code> for UTF-8 strings)
* Streaming decode/encode requires retaining partial buffers between calls.
* Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count)
** Some encode/decode algorithms require adjusting the <var>code point pointer</var> or <var>byte pointer</var> by a negative amount. This could occur across "chunk" boundaries. This implies that when the internal <var>streaming</var> flag is set on an encoder/decoder that the last N elements of the stream are saved for the next call and used as a prefix for the stream. N is defined by the specific encoding algorithm.
** This is not yet implemented in the [http://code.google.com/p/stringencoding/ ECMAScript shim]


=== API cleanup ===
== Resolved Issues ==


* Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
<dl>
 
<dt>Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
=== Spec Issues ===
<dd>Resolution: not for "v1" - can be implemented using this API, and breaking the string down by "character" is unlikely to be as obvious as it sounds - surrogate pairs, combining sequences, etc
 
* Resolve behavior when writing to a buffer that's too small - partial fill or don't change?
* Explicitly enumerate supported encodings
** Maximum list of encodings would be: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
 
== API ==


Scripts in pages access the API through the top-level <code>window.stringEncoding</code> object which holds methods for encoding/decoding strings. Worker scripts can similarly use the <code>self.stringEncoding</code> object. (Since <code>window</code> and <code>self</code> are the page and worker global object, respectively, scripts can simply refer to <code>stringEncoding</code> without a prefix.)
<dt>Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
<dd>Resolution: See above, and wait for developer feedback.


'''WebIDL'''
<dt>Allow arbitrary end byte sequences (e.g. <code>0xFF</code> for UTF-8 strings)
<dd>Resolution: Add <code>indexOf</code> to <code>ArrayBufferView</code>


<dt>Remove ''binary'' encoding? ''(a proposed 8-bit-clean encoding for interop with legacy binary data stored in ECMAScript strings)''
<dd>The only real use is with <code>atob()</code>/<code>btoa()</code>; a better API would be Base64 directly in/out of Typed Arrays. Consensus on WHATWG is to add better APIs e.g.
<pre>
<pre>
partial interface Window {
partial interface ArrayBufferView {
  readonly attribute StringEncoding stringEncoding;
    DOMString toBase64();
};
};


partial interface WorkerUtils {
partial interface ArrayBuffer {
  readonly attribute StringEncoding stringEncoding;
    static ArrayBuffer fromBase64(DOMString string);
};
};
</pre>
</pre>


The <code>stringEncoding</code> object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.
<dt>Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?
<dd>There seems to be pretty strong consensus that streaming, stateful coding is high priority and that an object-oriented API is cleanest, and shoe-horning those onto existing objects would be messy.


'''WebIDL'''
<dt>Should legacy encodings be supported?
<dd>Resolution: Consensus on the [http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-August/036825.html WHATWG mailing list] - support legacy encodings for decode, and only 'utf-8', 'utf-16' and 'utf-16be' for encode.


<pre>
<dt>What do to on encoding errors? If non-UTF encodings are supported then we may want to allow substitution (e.g. ASCII '?') or a script callback (for arbitrary escaping).
interface StringEncoding {
<dd>Resolution: not for "v1" (see above)


DOMString decode(ArrayBufferView view,
<dt>How are byte order marks handled?
                  optional DOMString encoding)
<dd>BOM is respected if and only if the requested coding is a case-insensitive match for "<code>utf-16</code>"
                        raises(DOMException);


DOMString stringLength(ArrayBufferView view,
</dl>
                        optional DOMString encoding)
                                  raises(DOMException);


unsigned long encode(DOMString value,
== API ==
                      ArrayBufferView view,
                      optional DOMString encoding)
                          raises(DOMException);


unsigned long encodedLength(DOMString value,
=== TextEncoder ===
                            optional DOMString encoding)
                                raises(DOMException);
}
</pre>


For all methods that take an <code>ArrayBufferView</code> parameter, the view's <code>byteOffset</code> and <code>byteLength</code> attributes are used to offset and limit the encoding/decoding operation against the view's underlying ArrayBuffer <code>buffer</code>. Per the Typed Array Specification, reading/writing beyond the limits of the view raises an exception.
'''WebIDL'''
<pre>
dictionary TextEncodeOptions {
  boolean stream = false;
};


=== <code>decode</code> ===
[Constructor(optional DOMString encoding)]
interface TextEncoder {
  readonly attribute DOMString encoding;
  Uint8Array encode(DOMString? string, optional TextEncodeOptions options);
};
</pre>


This method decodes a string from an binary data, using a specified encoding.  
The constructor runs the following steps:
# If the constructor is called with no arguments, let <var>label</var> be the "<code>utf-8</code>". Otherwise, let <var>label</var> be the value of the <var>encoding</var> argument.
# Run the '''steps to get an encoding''' from [http://encoding.spec.whatwg.org/ Encoding], with <var>label</var> as <var>label</var>.
#* If the steps result in failure, throw an "<code>EncodingError</code>" exception and terminate these steps.
#* Otherwise, if the <var>Name</var> of the returned encoding is not one of "<code>utf-8</code>", "<code>utf-16</code>", or "<code>utf-16be</code>" throw an "<code>EncodingError</code>" exception and terminate these steps.
#* Otherwise, set the encoder object's internal <var>encoding</var> property to the returned encoding.
# Initialize the internal <var>streaming</var> flag of the encoder object to false.
# Initialize the internal <var>encoding algorithm state</var> to the default values for the encoding <var>encoding</var>.


The method performs the <em>steps to decode a byte stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input <var>stream</var> provided by the byte data in <var>view.buffer</var> starting at offset <var>view.byteOffset</var> with length <var>view.byteLength</var>, and the input <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.  
<dl>
<dt><code>encoding</code> of type DOMString, readonly
<dd>Returns the <var>Name</var> of the encoder object's <var>encoding</var>, per [http://encoding.spec.whatwg.org/ Encoding].  


The method returns a DOMString by encoding the stream of code points emitted by the steps as UTF-16 as per [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL].
:Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with <var>encoding</var> of <code>"ascii"</code> the <var>encoding</var> attribute of the ''encoder object'' would have the value <code>"windows-1252"</code> as <code>"ascii"</code> is a label for that encoding.
</dl>


The <em>fatal flag</em> defined in [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] is not set.
<dl>
<dt><code>encode</code>
<dd>


If the decoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
The <code>encode</code> method runs these steps:


:''NOTE: '''U+0000''' characters have no special meaning and are returned as part of the string.''
# If the internal <var>streaming</var> flag is not set, then reset the <var>encoding algorithm state</var> to the default values for <var>encoding</var>. Otherwise, the <var>encoding algorithm state</var> is re-used from the previous call to <code>encode</code> on this object.
# If the <var>options</var> parameter is specified and the <code>stream</code> option is '''true''', then the internal <var>streaming</var> flag is set; otherwise the internal <var>streaming</var> flag is cleared.
# Run the steps of the <var>encoding algorithm</var>:
#* The input to the algorithm is a <var>stream of code points</var>. The stream is composed of the Unicode code points for the Unicode characters produced by following the [http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode steps to convert a DOMString to a sequence of Unicode characters] in [http://dev.w3.org/2006/webapi/WebIDL/ WebIDL] with <var>string</var> as the input. If <var>string</var> is null, the <var>stream of code points</var> is empty.
#* If the <var>options</var> parameter not specified or the <code>stream</code> option is false, then after final code point is yielded by the stream then the '''EOF code point''' is yielded.
#* The output of the the algorithm is a <var>sequence of emitted bytes</var>.
# Returns a <code>Unit8Array</code> object wrapping an <code>ArrayBuffer</code> containing the <var>sequence of emitted bytes</var> by encoder algorithm.


:''ISSUE: Behavior if decoding stops inside a multi-byte sequence.''
:''NOTE: Because only UTF encodings are supported, and because of the use of the [http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode steps to convert a DOMString to a sequence of Unicode characters] and thus Unicode code points, no input can cause the encoding process to emit an encoder error.''


=== <code>stringLength</code> ===
</dl>


This method determines the length of a "null-terminated" string encoded in binary data, using a specified encoding.
=== TextDecoder ===


This method performs the <em>steps to decode a byte stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input <var>stream</var> provided by the byte data in <var>view.buffer</var> starting at offset <var>view.byteOffset</var> with length <var>view.byteLength</var>, and the input <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
'''WebIDL'''
<pre>
dictionary TextDecoderOptions {
  boolean fatal = false;
};


As soon as the steps emit the code point '''U+0000''' decoding is terminated and the <em>byte pointer</em> within the byte stream is returned. If decoding completes and no '''U+0000''' code point was emitted, -1 is returned.
dictionary TextDecodeOptions {
  boolean stream = false;
};


The <em>fatal flag</em> defined in [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] is not set.
[Constructor(optional DOMString encoding, optional TextDecoderOptions options)]
interface TextDecoder {
  readonly attribute DOMString encoding;
  DOMString decode(optional ArrayBufferView view, optional TextDecodeOptions options);
};
</pre>


If the decoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
The constructor runs the following steps:
# The constructor creates a ''decoder object''. It has the internal properties <var>encoding</var>, a <var>fatal</var> flag which is initially unset,  a <var>streaming</var> flag which is initially unset, an <var>offset</var> pointer which is initially <code>0</code>, and a <var>useBOM</var> flag which is initially unset.
# If called without an <var>encoding</var> argument, let <var>label</var> be the "<code>utf-8</code>". Otherwise, let <var>label</var> be the value of the <var>encoding</var> argument.
# If <var>label</var> is a case-insensitive match for "<code>utf-16</code>" then set the internal <var>useBOM</var> flag.
# Run the '''steps to get an encoding''' from [http://encoding.spec.whatwg.org/ Encoding], with <var>label</var> as <var>label</var>.
## If the steps result in failure, throw a "<code>EncodingError</code>" exception and terminate these steps.
## Otherwise, set the ''decoder object's'' internal <var>encoding</var> property to the returned encoding.
# If the constructor is called with an <var>options</var> argument, and the <var>fatal</var> property of the dictionary is set, set the internal <var>fatal</var> flag of the ''decoder object''.
# Initialize the internal <var>encoding algorithm state</var> to the default values for the encoding <var>encoding</var>.
# Return the ''decoder object''.


:''NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence <code>0x00 0x00</code>''
<dl>
<dt><code>encoding</code> of type DOMString, readonly
<dd>
Returns the <var>Name</var> of the decoder object's <var>encoding</var>, per [http://encoding.spec.whatwg.org/ Encoding].


:''NOTE: If the encoded string includes a BOM, that is considered part of the length. For example, <code>stringLength</code> would return a length of <code>8</code> for the UTF-16BE sequence <code>0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 0x00 0x00</code>.
:Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with <var>encoding</var> of <code>"ascii"</code> the <var>encoding</var> attribute of the ''decoder object'' would have the value <code>"windows-1252"</code> as <code>"ascii"</code> is a label for that encoding.


:''ISSUE: Add an optional <code>unsigned short terminator</code> member, defaults to <code>0</code>?
<dt><code>decode</code>
<dd>


:''ISSUE: To allow terminators which aren't code points (e.g. 0xFF in UTF-8), make the optional terminator either a code point (default 0) or an Array of octets (e.g. [ 0xFF, 0XFF ] ?''
The <code>decode</code> method runs the following steps:


=== <code>encode</code> ===
# If <var>view</var> is not specified, let <var>view</var> be a <code>Uint8Array</code> of length <code>0</code>.
# If the internal <var>streaming</var> flag of the ''decoder object'' is not set, then reset the <var>encoding algorithm state</var> to the default values for encoding <var>encoding</var> and set <var>offset</var> to <code>0</code>. Otherwise, the <var>encoding algorithm state</var> is re-used from the previous call to <code>decode</code> on this object.
# If the <var>options</var> parameter is specified and the <var>stream</var> option is '''true''', then the internal <var>streaming</var> flag is set. Otherwise the internal <var>streaming</var> flag is cleared.
# Run or resume the decoder algorithm of the ''decoder object's'' encoder, with the following additions:
#* The input to the algorithm is a <var>byte stream</var>.
#** If <var>offset</var> is greater than <code>0</code>, the bytes in <var>byte stream</var> for positions less than <var>offset</var> are provided by buffer(s) passed in to previous calls to <code>decode()</code>.
#** The bytes in <var>byte stream</var> for positions <var>offset</var> through <code><var>offset</var> + <var>view.byteLength</var> - 1</code> are provided by the bytes in <var>view.buffer</var> starting at offset <var>view.byteOffset</var>.
#** When accessing the byte in <var>byte stream</var> at position <code><var>offset</var> + <var>view.byteLength</var></code>:
#*** If the internal <var>streaming</var> flag is not set, then yield the '''EOF byte'''.
#*** Otherwise, set <var>offset</var> to <code><var>offset</var> + <var>view.byteLength</var></code> and suspend the steps of the decoder algorithm until a subsequent call to <code>decode()</code>
#* If encoding is one of "<code>utf-8</code>", "<code>utf-16</code>" or "<code>utf-16be</code>" and <var>offset</var> is <code>0</code>, then prior to running the steps of the decoder algorithm:
#** If the internal <var>useBOM</var> flag is set, then:
#*** If less than two bytes are present in the stream and the internal <var>streaming</var> flag is true, suspend this algorithm and return the empty string.
#*** If the next two bytes of the stream are '''0xFF 0xFE''' then set <var>offset</var> to <code>2</code>, and clear the <var>utf-16be</var> flag of the decoder state.
#*** If the next two bytes of the stream are '''0xFE 0xFF''' then set <var>offset</var> to <code>2</code>, and set the <var>utf-16be</var> flag of the decoder state.
#** If encoding is "<code>utf-8</code>", then:
#*** If less than three bytes are present in the stream and the internal <var>streaming</var> flag is true, suspend this algorithm and return the empty string.
#*** If the next three bytes of the stream are '''0xEF 0xBB 0xBF''' then set <var>offset</var> to <code>3</code>.
#** If encoding is "<code>utf-16</code>", then:
#*** If less than two bytes are present in the stream and the internal <var>streaming</var> flag is true, suspend this algorithm and return the empty string.
#*** If the next two bytes of the stream are '''0xFF 0xFE''' then set <var>offset</var> to <code>2</code>.
#** If encoding is "<code>utf-16be</code>", then:
#*** If less than two bytes are present in the stream and the internal <var>streaming</var> flag is true, suspend this algorithm and return the empty string.
#*** If the next two bytes of the stream are '''0xFE 0xFF''' then set <var>offset</var> to <code>2</code>.
#* If the internal <var>fatal</var> flag of the ''decoder object'' is set, then a '''decoder error''' causes an <code>DOMException</code> of type <code>EncodingError</code> to be thrown rather than emitting a fallback code point.
#* The output of the algorithm is a <var>sequence of emitted code points</var>.
# Return an IDL <code>DOMString</code> value that represents the sequence of code units resulting from encoding the <var>sequence of emitted code points</var> as UTF-16.


This method encodes a string into binary data, using a specified encoding.
</dl>
 
The method performs the <em>steps to encode a code point stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input code point <var>stream</var> provided by the <code>DOMString</code> <var>value</var>, and the input string <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
 
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
 
:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''
 
If the encoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
 
Otherwise, the output of the encoding steps is a stream of bytes. If the length of the stream is greater than <var>view.byteLength</var> an exception ''(TBD)'' is raised
 
If this method raises an exception for any reason, <var>view.buffer</var> MUST NOT be modified.
 
:''ISSUE: Do we need to specify the case where encoding fails early due to length, but would have failed later due to invalid data?''
 
If this method does not raise an exception, the stream of bytes produced by the encoding steps is written to <var>view.buffer</var> starting at <var>view.byteOffset</var>, and the length of the stream of bytes is returned.
 
:''ISSUE: Would be nice to support "partial fill" and return an object with e.g. <code>bytesWritten</code> and <code>charactersWritten</code> properties.''
 
=== <code>encodedLength</code> ===
 
This method determines the byte length of an encoded string, using a specified encoding.
 
The method performs the <em>steps to encode a code point stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input code point <var>stream</var> provided by the <code>DOMString</code> <var>value</var>, and the input string <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
 
This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
 
:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''
 
If the encoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
 
If this method does not raise an exception, the length of the stream of bytes produced by the encoding steps is returned. The stream of bytes itself is not used. Implementations MAY therefore optimize to not produce an actual stream, or determine the length using other means for certain encodings, if the results are indistinguishable from those of performing the steps.


== Examples ==
== Examples ==
Line 164: Line 195:


<pre>
<pre>
function encodeArrayOfStrings(strings) {
function encodeArrayOfStrings(strings, encoding) {
   var len, i, bytes, view, offset;
   var encoder, encoded, len, i, bytes, view, offset;
 
  encoder = TextEncoder(encoding);
  encoded = [];


   len = Uint32Array.BYTES_PER_ELEMENT;
   len = Uint32Array.BYTES_PER_ELEMENT;
   for (i = 0; i < strings.length; i += 1) {
   for (i = 0; i < strings.length; i += 1) {
     len += Uint32Array.BYTES_PER_ELEMENT;
     len += Uint32Array.BYTES_PER_ELEMENT;
     len += stringEncoding.encodedLength(strings[i], "utf-8");
     encoded[i] = TextEncoder(encoding).encode(strings[i]);
    len += encoded[i].byteLength;
   }
   }


Line 179: Line 214:
   view.setUint32(offset, strings.length);
   view.setUint32(offset, strings.length);
   offset += Uint32Array.BYTES_PER_ELEMENT;
   offset += Uint32Array.BYTES_PER_ELEMENT;
   for (i = 0; i < strings.length; i += 1) {
   for (i = 0; i < encoded.length; i += 1) {
     len = stringEncoding.encode(strings[i],
     len = encoded[i].byteLength;
                                new DataView(bytes.buffer, offset + Uint32Array.BYTES_PER_ELEMENT),
                                "utf-8");
     view.setUint32(offset, len);
     view.setUint32(offset, len);
     offset += Uint32Array.BYTES_PER_ELEMENT + len;
     offset += Uint32Array.BYTES_PER_ELEMENT;
    bytes.set(encoded[i], offset);
    offset += len;
   }
   }
   return bytes.buffer;
   return bytes.buffer;
Line 195: Line 230:


<pre>
<pre>
function decodeArrayOfStrings(buffer) {
function decodeArrayOfStrings(buffer, encoding) {
   var view, offset, num_strings, strings, i, len;
   var decoder, view, offset, num_strings, strings, i, len;


  decoder = TextDecoder(encoding);
   view = new DataView(buffer);
   view = new DataView(buffer);
   offset = 0;
   offset = 0;
Line 207: Line 243:
     len = view.getUint32(offset);
     len = view.getUint32(offset);
     offset += Uint32Array.BYTES_PER_ELEMENT;
     offset += Uint32Array.BYTES_PER_ELEMENT;
     strings[i] = stringEncoding.decode(new DataView(buffer, offset, len),
     strings[i] = decoder.decode(
                                      "utf-8");
      new DataView(view.buffer, offset, len));
     offset += len;
     offset += len;
   }
   }
Line 217: Line 253:
== Encodings ==
== Encodings ==


Encodings are defined and implemented per [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding]. This implicitly includes the steps to <em>get an encoding</em> from a string, and the logic for label matching and case-insensitivity.
Encodings are defined and implemented per [http://encoding.spec.whatwg.org/ Encoding]. This implicitly includes the steps to <em>get an encoding</em> from a string, and the logic for label matching and case-insensitivity.
 
User agents MUST NOT support any other encodings or labels than those defined in [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the additions defined below.


:''ISSUE: Is it a MUST to support all encodings from Encoding spec, or list a subset here?''
User agents MUST NOT support any other encodings or labels than those defined in [http://encoding.spec.whatwg.org/ Encoding], and MUST support all encodings and labels defined in that specification, with the additions defined below.


:''ISSUE: Should anything be said about Unicode normalization forms?''
:''NOTE: In [http://encoding.spec.whatwg.org/ Encoding], "ascii" is a label for '''windows-1252'''; there is no 7-bit-or-raise-exception encoding. Applications that are required to restrict the content of decoded strings should implement validation after decoding.''


:''ISSUE: In [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], "ascii" is a label for '''windows-1252'''; there is no 7-bit-or-raise-exception encoding.''
:''NOTE: Unicode normalization forms are outside the scope of this specification. No normalization is done prior to encoding or after decoding.''


:''ISSUE: In [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], "iso-8859-1" is a label for '''windows-1252''' for compatibility with existing web content. Is there any reason a new API like this might distinguish them?''
:''NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by [http://encoding.spec.whatwg.org/ Encoding].''


:''NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding].''
== References ==


=== Additional Encodings ===
* WebIDL http://dev.w3.org/2006/webapi/WebIDL
* Encoding http://encoding.spec.whatwg.org/


The following additional encodings are defined by this specification. They are specific to the methods defined herein.
<table border=1 cellpadding=5>
<tr><th>Name<th>Labels
<tr><td>binary<td>"binary"
</table>
==== binary ====
The '''binary''' encoding is a single-byte encoding where the input code point and output byte are identical for the range '''U+0000''' to '''U+007f'''.
:''NOTE: This encoding is intended to allow interoperation with legacy code that encodes binary data in ECMAScript strings, for example the WindowBase64 methods [http://dev.w3.org/html5/spec-author-view/webappapis.html#dom-windowbase64-atob atob()]/[http://dev.w3.org/html5/spec-author-view/webappapis.html#dom-windowbase64-btoa window.btoa()] methods. It is recommended that new Web applications use Typed Arrays for transmission and storage of binary data.''
If '''binary''' is selected as the encoding, then step 3 of the steps to ''decode a byte stream'' are skipped.
:''NOTE: If <code>"binary"</code> is specified, byte order marks must be ignored.''
The '''binary decoder''' is:
# Let <var>byte</var> be byte pointer.
# If <byte> is the EOF byte, emit the EOF code point.
# Increase the byte pointer by one.
# If <var>byte</var> is in the range 0x00 to 0xFF, emit a code point whose value is <var>byte</var>
# Return failure
The '''binary encoder''' is:
# Let <var>code point</var> be the code point pointer
# If <var>code point</var> is the EOF code point, emit the EOF byte.
# Increase the code point pointer by one.
# If <var>code point</var> is in the range U+0000 to U+00FF, emit a byte whose value is <var>code point</var>
# Return failure


== Acknowledgements ==
== Acknowledgements ==
Line 273: Line 278:
* Robert Mustacchi
* Robert Mustacchi
* Ryan Dahl
* Ryan Dahl
 
* Anne van Kesteren
* Cameron McCormack


== Appendix ==
== Appendix ==


A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:
A "shim" implementation in ECMAScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:


:http://code.google.com/p/stringencoding/
:http://code.google.com/p/stringencoding/
[[Category:Proposals]]

Latest revision as of 15:16, 27 September 2014

This document is obsolete.

For the current specification, see: Encoding Standard: API


Proposed Text Encoding Web API for Typed Arrays

Editors

  • Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.

Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.

Open Issues

  • Should the encoding attribute return the Name of the encoding or the name that was passed in?

Notes to Implementers

  • Streaming decode/encode requires retaining partial buffers between calls.
    • Some encode/decode algorithms require adjusting the code point pointer or byte pointer by a negative amount. This could occur across "chunk" boundaries. This implies that when the internal streaming flag is set on an encoder/decoder that the last N elements of the stream are saved for the next call and used as a prefix for the stream. N is defined by the specific encoding algorithm.
    • This is not yet implemented in the ECMAScript shim

Resolved Issues

Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
Resolution: not for "v1" - can be implemented using this API, and breaking the string down by "character" is unlikely to be as obvious as it sounds - surrogate pairs, combining sequences, etc
Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
Resolution: See above, and wait for developer feedback.
Allow arbitrary end byte sequences (e.g. 0xFF for UTF-8 strings)
Resolution: Add indexOf to ArrayBufferView
Remove binary encoding? (a proposed 8-bit-clean encoding for interop with legacy binary data stored in ECMAScript strings)
The only real use is with atob()/btoa(); a better API would be Base64 directly in/out of Typed Arrays. Consensus on WHATWG is to add better APIs e.g.
partial interface ArrayBufferView {
    DOMString toBase64();
};

partial interface ArrayBuffer {
    static ArrayBuffer fromBase64(DOMString string);
};
Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?
There seems to be pretty strong consensus that streaming, stateful coding is high priority and that an object-oriented API is cleanest, and shoe-horning those onto existing objects would be messy.
Should legacy encodings be supported?
Resolution: Consensus on the WHATWG mailing list - support legacy encodings for decode, and only 'utf-8', 'utf-16' and 'utf-16be' for encode.
What do to on encoding errors? If non-UTF encodings are supported then we may want to allow substitution (e.g. ASCII '?') or a script callback (for arbitrary escaping).
Resolution: not for "v1" (see above)
How are byte order marks handled?
BOM is respected if and only if the requested coding is a case-insensitive match for "utf-16"

API

TextEncoder

WebIDL

dictionary TextEncodeOptions {
  boolean stream = false;
};

[Constructor(optional DOMString encoding)]
interface TextEncoder {
  readonly attribute DOMString encoding;
  Uint8Array encode(DOMString? string, optional TextEncodeOptions options);
};

The constructor runs the following steps:

  1. If the constructor is called with no arguments, let label be the "utf-8". Otherwise, let label be the value of the encoding argument.
  2. Run the steps to get an encoding from Encoding, with label as label.
    • If the steps result in failure, throw an "EncodingError" exception and terminate these steps.
    • Otherwise, if the Name of the returned encoding is not one of "utf-8", "utf-16", or "utf-16be" throw an "EncodingError" exception and terminate these steps.
    • Otherwise, set the encoder object's internal encoding property to the returned encoding.
  3. Initialize the internal streaming flag of the encoder object to false.
  4. Initialize the internal encoding algorithm state to the default values for the encoding encoding.
encoding of type DOMString, readonly
Returns the Name of the encoder object's encoding, per Encoding.
Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of "ascii" the encoding attribute of the encoder object would have the value "windows-1252" as "ascii" is a label for that encoding.
encode
The encode method runs these steps:
  1. If the internal streaming flag is not set, then reset the encoding algorithm state to the default values for encoding. Otherwise, the encoding algorithm state is re-used from the previous call to encode on this object.
  2. If the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared.
  3. Run the steps of the encoding algorithm:
    • The input to the algorithm is a stream of code points. The stream is composed of the Unicode code points for the Unicode characters produced by following the steps to convert a DOMString to a sequence of Unicode characters in WebIDL with string as the input. If string is null, the stream of code points is empty.
    • If the options parameter not specified or the stream option is false, then after final code point is yielded by the stream then the EOF code point is yielded.
    • The output of the the algorithm is a sequence of emitted bytes.
  4. Returns a Unit8Array object wrapping an ArrayBuffer containing the sequence of emitted bytes by encoder algorithm.
NOTE: Because only UTF encodings are supported, and because of the use of the steps to convert a DOMString to a sequence of Unicode characters and thus Unicode code points, no input can cause the encoding process to emit an encoder error.

TextDecoder

WebIDL

dictionary TextDecoderOptions {
  boolean fatal = false;
};

dictionary TextDecodeOptions {
  boolean stream = false;
};

[Constructor(optional DOMString encoding, optional TextDecoderOptions options)]
interface TextDecoder {
  readonly attribute DOMString encoding;
  DOMString decode(optional ArrayBufferView view, optional TextDecodeOptions options);
};

The constructor runs the following steps:

  1. The constructor creates a decoder object. It has the internal properties encoding, a fatal flag which is initially unset, a streaming flag which is initially unset, an offset pointer which is initially 0, and a useBOM flag which is initially unset.
  2. If called without an encoding argument, let label be the "utf-8". Otherwise, let label be the value of the encoding argument.
  3. If label is a case-insensitive match for "utf-16" then set the internal useBOM flag.
  4. Run the steps to get an encoding from Encoding, with label as label.
    1. If the steps result in failure, throw a "EncodingError" exception and terminate these steps.
    2. Otherwise, set the decoder object's internal encoding property to the returned encoding.
  5. If the constructor is called with an options argument, and the fatal property of the dictionary is set, set the internal fatal flag of the decoder object.
  6. Initialize the internal encoding algorithm state to the default values for the encoding encoding.
  7. Return the decoder object.
encoding of type DOMString, readonly
Returns the Name of the decoder object's encoding, per Encoding.
Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of "ascii" the encoding attribute of the decoder object would have the value "windows-1252" as "ascii" is a label for that encoding.
decode
The decode method runs the following steps:
  1. If view is not specified, let view be a Uint8Array of length 0.
  2. If the internal streaming flag of the decoder object is not set, then reset the encoding algorithm state to the default values for encoding encoding and set offset to 0. Otherwise, the encoding algorithm state is re-used from the previous call to decode on this object.
  3. If the options parameter is specified and the stream option is true, then the internal streaming flag is set. Otherwise the internal streaming flag is cleared.
  4. Run or resume the decoder algorithm of the decoder object's encoder, with the following additions:
    • The input to the algorithm is a byte stream.
      • If offset is greater than 0, the bytes in byte stream for positions less than offset are provided by buffer(s) passed in to previous calls to decode().
      • The bytes in byte stream for positions offset through offset + view.byteLength - 1 are provided by the bytes in view.buffer starting at offset view.byteOffset.
      • When accessing the byte in byte stream at position offset + view.byteLength:
        • If the internal streaming flag is not set, then yield the EOF byte.
        • Otherwise, set offset to offset + view.byteLength and suspend the steps of the decoder algorithm until a subsequent call to decode()
    • If encoding is one of "utf-8", "utf-16" or "utf-16be" and offset is 0, then prior to running the steps of the decoder algorithm:
      • If the internal useBOM flag is set, then:
        • If less than two bytes are present in the stream and the internal streaming flag is true, suspend this algorithm and return the empty string.
        • If the next two bytes of the stream are 0xFF 0xFE then set offset to 2, and clear the utf-16be flag of the decoder state.
        • If the next two bytes of the stream are 0xFE 0xFF then set offset to 2, and set the utf-16be flag of the decoder state.
      • If encoding is "utf-8", then:
        • If less than three bytes are present in the stream and the internal streaming flag is true, suspend this algorithm and return the empty string.
        • If the next three bytes of the stream are 0xEF 0xBB 0xBF then set offset to 3.
      • If encoding is "utf-16", then:
        • If less than two bytes are present in the stream and the internal streaming flag is true, suspend this algorithm and return the empty string.
        • If the next two bytes of the stream are 0xFF 0xFE then set offset to 2.
      • If encoding is "utf-16be", then:
        • If less than two bytes are present in the stream and the internal streaming flag is true, suspend this algorithm and return the empty string.
        • If the next two bytes of the stream are 0xFE 0xFF then set offset to 2.
    • If the internal fatal flag of the decoder object is set, then a decoder error causes an DOMException of type EncodingError to be thrown rather than emitting a fallback code point.
    • The output of the algorithm is a sequence of emitted code points.
  5. Return an IDL DOMString value that represents the sequence of code units resulting from encoding the sequence of emitted code points as UTF-16.

Examples

Example #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings, encoding) {
  var encoder, encoded, len, i, bytes, view, offset;

  encoder = TextEncoder(encoding);
  encoded = [];

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    encoded[i] = TextEncoder(encoding).encode(strings[i]);
    len += encoded[i].byteLength;
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < encoded.length; i += 1) {
    len = encoded[i].byteLength;
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    bytes.set(encoded[i], offset);
    offset += len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer, encoding) {
  var decoder, view, offset, num_strings, strings, i, len;

  decoder = TextDecoder(encoding);
  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = decoder.decode(
      new DataView(view.buffer, offset, len));
    offset += len;
  }
  return strings;
}

Encodings

Encodings are defined and implemented per Encoding. This implicitly includes the steps to get an encoding from a string, and the logic for label matching and case-insensitivity.

User agents MUST NOT support any other encodings or labels than those defined in Encoding, and MUST support all encodings and labels defined in that specification, with the additions defined below.

NOTE: In Encoding, "ascii" is a label for windows-1252; there is no 7-bit-or-raise-exception encoding. Applications that are required to restrict the content of decoded strings should implement validation after decoding.
NOTE: Unicode normalization forms are outside the scope of this specification. No normalization is done prior to encoding or after decoding.
NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by Encoding.

References


Acknowledgements

  • Alan Chaney
  • Ben Noordhuis
  • Glenn Maynard
  • John Tamplin
  • Kenneth Russell (Google, Inc)
  • Robert Mustacchi
  • Ryan Dahl
  • Anne van Kesteren
  • Cameron McCormack

Appendix

A "shim" implementation in ECMAScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:

http://code.google.com/p/stringencoding/