A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

Difference between revisions of "StringEncoding"

From WHATWG Wiki
Jump to: navigation, search
(TextEncoder)
(Rewrote decode method algorithmically)
Line 115: Line 115:
 
The constructor follows the steps to get an encoding from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with <var>encoding</var> as <var>label</var>. If the steps result in failure, a <code>DOMException</code> of type <code>EncodingError</code> is thrown. Otherwise, set the ''decoder object's'' internal <var>encoding</var> property to the returned encoding.  
 
The constructor follows the steps to get an encoding from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with <var>encoding</var> as <var>label</var>. If the steps result in failure, a <code>DOMException</code> of type <code>EncodingError</code> is thrown. Otherwise, set the ''decoder object's'' internal <var>encoding</var> property to the returned encoding.  
  
If the constructor is called with an <var>options</var> argument, and the <var>fatal</var> property of the dictionary is set, the internal <var>fatal</var> flag of the ''decoder object'' is set.  
+
If the constructor is called with an <var>options</var> argument, and the <var>fatal</var> property of the dictionary is set, the internal <var>fatal</var> flag of the ''decoder object'' is set, otherwise the internal <var>fatal</var> flag is cleared.
 +
 
 +
Initialize the internal <var>encoding algorithm state</var> to the default values for the encoding <var>encoding</var>.
 
<dl>
 
<dl>
 
<dt><code>encoding</code> of type DOMString, readonly
 
<dt><code>encoding</code> of type DOMString, readonly
Line 125: Line 127:
 
<dt><code>decode</code>
 
<dt><code>decode</code>
 
<dd>
 
<dd>
This method runs the decoder algorithm of the ''decoder object's'' encoder over the byte stream from <var>view.buffer</var> starting at offset <var>view.byteOffset</var>. A maximum of <var>view.byteLength</var> bytes are yielded by the stream from <var>view.buffer</var>. If <var>view</var> is not specified, the stream is empty.
 
  
:''ISSUE: Need to handle BOMs. [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.''
+
The <code>decode</code> method runs the following steps:
  
If the internal <var>streaming</var> flag of the ''decoder object'' is not set, then the decoder algorithm's state (flags, etc) is reset prior to performing the steps. Otherwise, the decoder algorithm's state is re-used from the previous call to <code>decode</code> on this object. After the above inspection of the internal <var>streaming</var> flag, if the <var>options</var> parameter is specified and the '''stream''' option is '''true''', then the internal <var>streaming</var> flag is set; otherwise the internal <var>streaming</var> flag is cleared.
+
# If the internal <var>streaming</var> flag of the ''decoder object'' is not set, then reset the <var>encoding algorithm state</var> to the default values for encoding <var>encoding</var>. Otherwise, the <var>encoding algorithm state</var> is re-used from the previous call to <code>decode</code> on this object.  
 +
# If the <var>options</var> parameter is specified and the <var>stream<var> option is '''true''', then the internal <var>streaming</var> flag is set; otherwise the internal <var>streaming</var> flag is cleared.
 +
# Run the decoder algorithm of the ''decoder object's'' encoder
 +
#* The input to the algorithm is a <var>byte stream</var>. The <var>byte stream</var> is provided by the bytes in <var>view.buffer</var> starting at offset <var>view.byteOffset</var>. A maximum of <var>view.byteLength</var> bytes are yielded by the stream from <var>view.buffer</var>. If <var>view</var> is not specified, the stream is empty.
 +
#* If the <var>options</var> parameter not specified or the '''stream''' option is '''false''', then after <var>view.byteLength</var> bytes are yielded by the stream the '''EOF byte''' is yielded.
 +
#* If the internal <var>fatal</var> flag of the ''decoder object'' is set, then a '''decoder error''' causes an <code>DOMException</code> of type <code>EncodingError</code> to be thrown rather than emitting a fallback code point.
 +
#* The output of the algorithm is a sequence of <var>emitted code points</var>.
 +
# Return a <code>DOMString</code> by encoding the <var>sequence of emitted code points</var> as UTF-16 as per [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL].
  
If the <var>options</var> parameter not specified or the '''stream''' option is '''false''', then after <var>view.byteLength</var> bytes are yielded by the stream the '''EOF byte''' is yielded.
+
:''ISSUE: Need to handle BOMs. [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding] specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.''
 
 
If the internal <var>fatal</var> flag of the ''decoder object'' is set, then a '''decoder error''' causes an <code>DOMException</code> of type <code>EncodingError</code> to be thrown rather than emitting a fallback code point.
 
 
 
Once the algorithm has no more bytes to process, the method returns a <code>DOMString</code> by encoding the stream of code points emitted by the steps as UTF-16 as per [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL].
 
  
 
</dl>
 
</dl>

Revision as of 17:13, 24 May 2012

Proposed Text Encoding Web API for Typed Arrays

Editors

  • Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

Discussion on this topic has so far taken place on the public_webgl@khronos.org mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.

Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.

Open Issues

General: Should this be a standalone API (as written), or live on e.g. DataView, or on e.g. String?

Scenarios

  • Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character

Desired Features

  • Allow arbitrary end byte sequences (e.g. 0xFF for UTF-8 strings)
    • Tentative Resolution: Add indexOf to ArrayBufferView
  • Encoding errors

API cleanup

  • Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer
    • Tentative Resolution: Wait for developer feedback.

Spec Issues

ISSUE: Encoding defines the byte order mark as more authoritative than anything else. Is that desirable here?
  • Remove binary encoding? The only real use is with atob()/btoa(); a better API would be Base64 directly in/out of Typed Arrays

API

The default encoding is "utf-8".

TextEncoder

WebIDL

dictionary TextEncodeOptions {
  boolean stream = false;
};

[Constructor,
 Constructor(DOMString encoding)]
interface TextEncoder {
  readonly attribute DOMString encoding;
  ArrayBufferView encode(DOMString? string, optional TextEncodeOptions options);
};

If the constructor is called with no arguments, let encoding be the default encoding.

The constructor follows the steps to get an encoding from Encoding, with encoding as label. If the steps result in failure, an DOMException of type EncodingError is thrown. Otherwise, set the encoder object's internal encoding property to the returned encoding. Initialize the internal streaming flag of the encoder object to false. Initialize the internal encoding algorithm state to the default values for the encoding encoding.

encoding of type DOMString, readonly
Returns the Name of the encoder object's encoding, per Encoding.
Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of "ascii" the encoding attribute of the encoder object would have the value "windows-1252" as "ascii" is a label for that encoding.
encode
The encode method runs these steps:
  1. If the internal streaming flag is not set, then reset the encoding algorithm state to the default values for encoding. Otherwise, the encoding algorithm state is re-used from the previous call to encode on this object.
  2. If the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared.
  3. Run the steps of the encoding algorithm:
    • The input to the algorithm is a stream of code points. The code units within the DOMString string are interpreted as UTF-16 code units, to produce a stream of code points; if string is null, the stream is empty.
      ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.
    • If the options parameter not specified or the stream option is false, then after final code point is yielded by the stream then the EOF code point is yielded.
    • The output of the the algorithm is a sequence of emitted bytes.
  4. Returns a Unit8Array object wrapping an ArrayBuffer containing the sequence of bytes emitted by encoder algorithm.

TextDecoder

WebIDL

dictionary TextDecoderOptions {
  boolean fatal = false;
};

dictionary TextDecodeOptions {
  boolean stream = false;
};

[Constructor,
 Constructor(optional DOMString encoding, optional TextDecoderOptions options)]
interface TextDecoder {
  readonly attribute DOMString encoding;
  DOMString decode(optional ArrayBufferView view, optional TextDecodeOptions options);
};

The constructor creates a decoder object. It has the internal properties encoding, a fatal flag which is initially unset, and a streaming flag which is initially unset.

If called without an encoding argument, let encoding be the default encoding.

The constructor follows the steps to get an encoding from Encoding, with encoding as label. If the steps result in failure, a DOMException of type EncodingError is thrown. Otherwise, set the decoder object's internal encoding property to the returned encoding.

If the constructor is called with an options argument, and the fatal property of the dictionary is set, the internal fatal flag of the decoder object is set, otherwise the internal fatal flag is cleared.

Initialize the internal encoding algorithm state to the default values for the encoding encoding.

encoding of type DOMString, readonly
Returns the Name of the decoder object's encoding, per Encoding.
Note that this may differ from the name of the encoding specified during the call to the constructor. For example, if the constructor is called with encoding of "ascii" the encoding attribute of the decoder object would have the value "windows-1252" as "ascii" is a label for that encoding.
decode
The decode method runs the following steps:
  1. If the internal streaming flag of the decoder object is not set, then reset the encoding algorithm state to the default values for encoding encoding. Otherwise, the encoding algorithm state is re-used from the previous call to decode on this object.
  2. If the options parameter is specified and the stream option is true, then the internal streaming flag is set; otherwise the internal streaming flag is cleared.
  3. Run the decoder algorithm of the decoder object's encoder
    • The input to the algorithm is a byte stream. The byte stream is provided by the bytes in view.buffer starting at offset view.byteOffset. A maximum of view.byteLength bytes are yielded by the stream from view.buffer. If view is not specified, the stream is empty.
    • If the options parameter not specified or the stream option is false, then after view.byteLength bytes are yielded by the stream the EOF byte is yielded.
    • If the internal fatal flag of the decoder object is set, then a decoder error causes an DOMException of type EncodingError to be thrown rather than emitting a fallback code point.
    • The output of the algorithm is a sequence of emitted code points.
  4. Return a DOMString by encoding the sequence of emitted code points as UTF-16 as per WebIDL.
ISSUE: Need to handle BOMs. Encoding specifies this by using the caller-specified encoding as a suggestion, and consuming the BOM as a part of selecting the real encoder where BOM takes precedence. At the very least a matching BOM should be ignored. A mismatching BOM could throw, be a decoding error, or could actually switch the decoder for this stream (call or sequence of calls), possibly if-and-only-if the constructor was called without an encoding.

Examples

Example #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings, encoding) {
  var encoder, encoded, len, i, bytes, view, offset;

  encoder = TextEncoder(encoding);
  encoded = [];

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    encoded[i] = TextEncoder(encoding).encode(strings[i]);
    len += encoded[i].byteLength;
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < encoded.length; i += 1) {
    len = encoded[i].byteLength;
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    bytes.set(encoded[i], offset);
    offset += len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer, encoding) {
  var decoder, view, offset, num_strings, strings, i, len;

  decoder = TextDecoder(encoding);
  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = decoder.decode(
      new DataView(view.buffer, offset, len));
    offset += len;
  }
  return strings;
}

Encodings

Encodings are defined and implemented per Encoding. This implicitly includes the steps to get an encoding from a string, and the logic for label matching and case-insensitivity.

User agents MUST NOT support any other encodings or labels than those defined in Encoding, and MUST support all encodings and labels defined in that specification, with the additions defined below.

NOTE: In Encoding, "ascii" is a label for windows-1252; there is no 7-bit-or-raise-exception encoding. Applications that are required to restrict the content of decoded strings should implement validation after decoding.
NOTE: Unicode normalization forms are outside the scope of this specification. No normalization is done prior to encoding or after decoding.
NOTE: Handling of encoding-specific issues, e.g. over-long UTF-8 encodings, byte order marks, unmatched surrogate pairs, and so on is defined by Encoding.

Additional Encodings

The following additional encodings are defined by this specification. They are specific to the methods defined herein.

NameLabels
binary"binary"

binary

The binary encoding is a single-byte encoding where the input code point and output byte are identical for the range U+0000 to U+00ff.

NOTE: This encoding is intended to allow interoperation with legacy code that encodes binary data in ECMAScript strings, for example the WindowBase64 methods atob()/btoa() methods. It is recommended that new Web applications use Typed Arrays for transmission and storage of binary data.

If binary is selected as the encoding, then step 3 of the steps to decode a byte stream are skipped; no BOM detection is performed.

NOTE: If "binary" is specified, byte order marks must be ignored.

The binary decoder is:

  1. Let byte be byte pointer.
  2. If <byte> is the EOF byte, emit the EOF code point.
  3. Increase the byte pointer by one.
  4. Emit a code point whose value is byte

The binary encoder is:

  1. Let code point be the code point pointer
  2. If code point is the EOF code point, emit the EOF byte.
  3. Increase the code point pointer by one.
  4. If code point is in the range U+0000 to U+00FF, emit a byte whose value is code point
  5. Emit an encoder error

References


Acknowledgements

  • Alan Chaney
  • Ben Noordhuis
  • Glenn Maynard
  • John Tamplin
  • Kenneth Russell (Google, Inc)
  • Robert Mustacchi
  • Ryan Dahl
  • Anne van Kesteren

Appendix

A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:

http://code.google.com/p/stringencoding/