StringEncoding: Difference between revisions

Revision as of 21:26, 14 March 2012

Proposed String Encoding API for Typed Arrays

Editors

Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.

Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.

Open Issues

General: Rewrite in terms of http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html including

default encoding (?)
specification of encoding as fallback (i.e. BOM, if present, wins)
list of supported encodings
selection of encoding
Depending on API, setting "fatal flag" vs. using fallbacks
behavior of replacement characters ("fallback code point")

NOTE: This precludes "binary"; may want to add it.

Scenarios

Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
Parse/emit legacy data formats that do not use UTF-8

Desired Features

Allow arbitrary end byte sequences (e.g. 0xFF for UTF-8 strings)
Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count)

API cleanup

Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer

Spec Issues

Resolve behavior when writing to a buffer that's too small - partial fill or don't change?
Explicitly enumerate supported encodings
- Maximum list of encodings would be: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

API

Scripts in pages access the API through the top-level window.stringEncoding object which holds methods for encoding/decoding strings. Worker scripts can similarly use the self.stringEncoding object. (Since window and self are the page and worker global object, respectively, scripts can simply refer to stringEncoding without a prefix.)

WebIDL

partial interface Window {
   readonly attribute StringEncoding stringEncoding;
};

partial interface WorkerUtils {
   readonly attribute StringEncoding stringEncoding;
};

The stringEncoding object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.

WebIDL

interface StringEncoding {

 DOMString decode(ArrayBufferView view,
                  optional DOMString encoding)
                        raises(DOMException);

 DOMString stringLength(ArrayBufferView view,
                        optional DOMString encoding)
                                   raises(DOMException);

 unsigned long encode(DOMString value, 
                      ArrayBufferView view,
                      optional DOMString encoding)
                          raises(DOMException);

 unsigned long encodedLength(DOMString value,
                             optional DOMString encoding)
                                 raises(DOMException);
}

For all methods that take an ArrayBufferView parameter, the view's byteOffset and byteLength attributes are used to offset and limit the encoding/decoding operation against the view's underlying ArrayBuffer buffer. Per the Typed Array Specification, reading/writing beyond the limits of the view raises an exception.

`decode`

This method decodes a string from an binary data, using a specified encoding.

The method performs the steps to decode a byte stream from Encoding, with the input stream provided by the byte data in view.buffer starting at offset view.byteOffset with length view.byteLength, and the input label from encoding if specified, "utf-8" otherwise.

The method returns a DOMString by encoding the stream of code points emitted by the steps as UTF-16 as per WebIDL.

The fatal flag defined in Encoding is not set.

If the decoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.

NOTE: U+0000 characters have no special meaning and are returned as part of the string.

ISSUE: Behavior if decoding stops inside a multi-byte sequence.

`stringLength`

This method determines the length of a "null-terminated" string encoded in binary data, using a specified encoding.

This method performs the steps to decode a byte stream from Encoding, with the input stream provided by the byte data in view.buffer starting at offset view.byteOffset with length view.byteLength, and the input label from encoding if specified, "utf-8" otherwise.

As soon as the steps emit the code point U+0000 decoding is terminated and the byte pointer within the byte stream is returned. If decoding completes and no U+0000 code point was emitted, -1 is returned.

The fatal flag defined in Encoding is not set.

If the decoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.

NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence 0x00 0x00

NOTE: If the encoded string includes a BOM, that is considered part of the length. For example, stringLength would return a length of 8 for the UTF-16BE sequence 0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 0x00 0x00.

ISSUE: Add an optional unsigned short terminator member, defaults to 0?

ISSUE: To allow terminators which aren't code points (e.g. 0xFF in UTF-8), make the optional terminator either a code point (default 0) or an Array of octets (e.g. [ 0xFF, 0XFF ] ?

`encode`

This method encodes a string into binary data, using a specified encoding.

The method performs the steps to encode a code point stream from Encoding, with the input code point stream provided by the DOMString value, and the input string label from encoding if specified, "utf-8" otherwise.

This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.

ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.

If the encoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.

Otherwise, the output of the encoding steps is a stream of bytes. If the length of the stream is longer than view.byteLength an exception (TBD) is raised

If this method raises an exception for any reason, view.buffer MUST NOT be modified.

ISSUE: Do we need to specify the case where encoding fails early due to length, but would have failed later due to invalid data?

If this method does not raise an exception, the stream of bytes produced by the encoding steps is written to view.buffer starting at view.byteOffset, and the length of the stream of bytes is returned.

ISSUE: Would be nice to support "partial fill" and return an object with e.g. bytesWritten and charactersWritten properties.

`encodedLength`

This method determines the byte length of an encoded string, using a specified encoding.

The method performs the steps to encode a code point stream from Encoding, with the input code point stream provided by the DOMString value, and the input string label from encoding if specified, "utf-8" otherwise.

This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.

ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. WebIDL only defines the reverse.

If the encoding steps return failure (including if the specified encoding is not matched), an exception (TBD) is raised.

If this method does not raise an exception, the length of the stream of bytes produced by the encoding steps is returned. The stream of bytes itself is not used. Implementations MAY therefore optimize to not produce an actual stream, or determine the length using other means for certain encodings, if the results are indistinguishable from those of performing the steps.

Examples

Example #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings) {
  var len, i, bytes, view, offset;

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    len += stringEncoding.encodedLength(strings[i], "utf-8");
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len = stringEncoding.encode(strings[i], 
                                new DataView(bytes.buffer, offset + Uint32Array.BYTES_PER_ELEMENT),
                                "utf-8");
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT + len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer) {
  var view, offset, num_strings, strings, i, len;

  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = stringEncoding.decode(new DataView(buffer, offset, len), 
                                       "utf-8");
    offset += len;
  }
  return strings;
}

Encodings

Encoding names are case-insensitive.

Standard Encodings

Implementations MUST support all of the following encodings:

ISSUE: Should anything be said about canonical forms in the Unicode encodings?

Implementations MUST NOT support encodings not listed in the specification, to avoid interoperability issues.

`ASCII`

decode: exception thrown if any octet in array is greater than 0x7F
encode: exception thrown if value string contains a character beyond U+007F

`ISO-8859-1`

decode: No encoding-specific exceptions thrown
encode: exception thrown if value string contains a character beyond U+00FF

`BINARY`

decode: No encoding-specific exceptions are thrown
encode: exception thrown if value string contains a character beyond U+00FF

NOTE: ISO-8859-1 and BINARY are functionally identical in this specification. Both are included so that callers can be more explicit about the type of data being handled. Storing binary data in ECMAScript strings, one byte per character, was a common approach before Typed Array support was available. This "BINARY" encoding allows for easy interoperation with this legacy style of binary storage.

`UTF-8`

decode: BOM accepted (0xEF 0xBB 0xBF), exception thrown on invalid/truncated UTF-8 sequence; non-BMP characters in the UTF-8 encoded string yield UTF-16 surrogate pairs in the DOMString. BOM is not returned as part of the DOMString.
encode: BOM is not written. Exception (TBD) thrown when there is no valid UTF-8 encoding of the string (e.g. "abc\uD800def" which contains a UTF-16 "surrogate half")

ISSUE: Are over-long UTF-8 sequences allowed?

`UTF-16`

UTF-16 is an alias for UTF-16LE

`UTF-16LE`

decode: BOM not required, but accepted (0xFF 0xFE); throws if incorrect BOM found or overall length is odd number of bytes. BOM is not returned as part of the DOMString.
encode: does not write a BOM

ISSUE: throw if invalid surrogate pair encountered?

`UTF-16BE`

decode: BOM not required, but accepted (0xFE 0xFF); throws if incorrect BOM found or overall length is odd number of bytes. BOM is not returned as part of the DOMString.
encode: does not write a BOM

ISSUE: throw if invalid surrogate pair encountered?

Acknowledgements

Alan Chaney
Ben Noordhuis
Glenn Maynard
John Tamplin
Kenneth Russell (Google, Inc)
Robert Mustacchi
Ryan Dahl

Appendix

A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:

http://code.google.com/p/stringencoding/

@@ Line 144: / Line 144: @@
 :''ISSUE: Do we need to specify the case where encoding fails early due to length, but would have failed later due to invalid data?''
-If this method does not throw an exception, the stream of bytes produced by the encoding steps is written to <var>view.buffer</var> starting at <var>view.byteOffset</var>, and the length of the stream of bytes is returned.
+If this method does not raise an exception, the stream of bytes produced by the encoding steps is written to <var>view.buffer</var> starting at <var>view.byteOffset</var>, and the length of the stream of bytes is returned.
 :''ISSUE: Would be nice to support "partial fill" and return an object with e.g. <code>bytesWritten</code> and <code>charactersWritten</code> properties.''
@@ Line 150: / Line 150: @@
 === <code>encodedLength</code> ===
-Computes and returns the length, in bytes, of the string <var>value</var> if it were to be encoded using the specified <var>encoding</var>. If the <var>encoding</var> parameter is omitted it defaults to <code>"utf-8"</code>. If <var>value</var> cannot be encoded with the specified encoding an exception ''(TBD)'' is raised. If the specified encoding is not known, an exception ''(TBD)'' is raised.
+This method determines the byte length of an encoded string, using a specified encoding.
-:''NOTE: If the encoding includes a BOM, the length of the BOM is included. For example, the string <code>ABC</code> may be encoded in UTF-16 as the octets <code>0xFE 0xFF 0x00 0x65 0x00 0x66 0x00 0x67</code> and have a length of 8.''
+The method performs the <em>steps to encode a code point stream</em> from [http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Encoding], with the input code point <var>stream</var> provided by the <code>DOMString</code> <var>value</var>, and the input string <var>label</var> from <var>encoding</var> if specified, <code>"utf-8"</code> otherwise.
+This specification requires that the code units within the DOMString are interpreted as UTF-16 code units. That is, to produce the code point stream a UTF-16 decoding operation must be performed to handle surrogate pairs.
+:''ISSUE: Interpreting a DOMString as UTF-16 to yield a code unit stream needs to be defined, including unpaired surrogates. [http://dev.w3.org/2006/webapi/WebIDL/#idl-DOMString WebIDL] only defines the reverse.''
+If the encoding steps return failure (including if the specified <var>encoding</var> is not matched), an exception ''(TBD)'' is raised.
+If this method does not raise an exception, the length of the stream of bytes produced by the encoding steps is returned. The stream of bytes itself is not used. Implementations MAY therefore optimize to not produce an actual stream, or determine the length using other means for certain encodings, if the results are indistinguishable from those of performing the steps.
 == Examples ==

StringEncoding: Difference between revisions

Revision as of 21:26, 14 March 2012

Contents

Editors

Abstract

Open Issues

Scenarios

Desired Features

API cleanup

Spec Issues

API

`decode`

`stringLength`

`encode`

`encodedLength`

Examples

Example #1 - encoding strings

Example #2 - decoding strings

Encodings

Standard Encodings

`ASCII`

`ISO-8859-1`

`BINARY`

`UTF-8`

`UTF-16`

`UTF-16LE`

`UTF-16BE`

Acknowledgements

Appendix

Navigation menu

StringEncoding: Difference between revisions

Revision as of 21:26, 14 March 2012

Editors

Abstract

Open Issues

Scenarios

Desired Features

API cleanup

Spec Issues

API

decode

stringLength

encode

encodedLength

Examples

Example #1 - encoding strings

Example #2 - decoding strings

Encodings

Standard Encodings

ASCII

ISO-8859-1

BINARY

UTF-8

UTF-16

UTF-16LE

UTF-16BE

Acknowledgements

Appendix

Navigation menu

Search

`decode`

`stringLength`

`encode`

`encodedLength`

`ASCII`

`ISO-8859-1`

`BINARY`

`UTF-8`

`UTF-16`

`UTF-16LE`

`UTF-16BE`