StringEncoding: Difference between revisions

Revision as of 17:36, 4 November 2011

Proposed String Encoding API for Typed Arrays

Editors

Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

API

Scripts in pages access the API through the top-level window.stringEncoding object which holds methods for encoding/decoding strings. Worker scripts can similarly use the self.stringEncoding object. (Since window and self are the page and worker global object, respectively, scripts can simply refer to stringEncoding without a prefix.)

WebIDL

partial interface Window {
   readonly attribute StringEncoding stringEncoding;
};

partial interface WorkerUtils {
   readonly attribute StringEncoding stringEncoding;
};

The stringEncoding object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.
WebIDL

interface [
 OmitConstructor
] StringEncoding {

 static DOMString decode(in any array,
                        unsigned long byteOffset,
                        long byteLength,
                        DOMString encoding)
                            raises(DOMException);

 static DOMString detectEncoding(in any array,
                        unsigned long byteOffset,
                        unsigned long byteLength)
                            raises(DOMException);

 static unsigned long encode(in any array,
                            unsigned long byteOffset,
                            DOMString encoding,
                            DOMString value)
                                raises(DOMException);

 static unsigned long encodedLength(DOMString encoding,
                                    DOMString value)
                                        raises(DOMException);
}

decode

This method decodes a string at the given byteOffset and byteLength, using the specified encoding. The array parameter must be an ArrayBuffer or DataView, otherwise a TypeError exception is raised. If array is a DataView, the byteOffset is in addition to any offset between the DataView and the underlying ArrayBuffer. 
Data is decoded starting at byteOffset. If byteLength is a negative number, decoding continues until a U+0000 character is decoded; the terminal U+0000 character MUST NOT be included in the returned string. If byteLength is a positive number, decoding continues until byteLength bytes have been processed and U+0000 characters have no special meaning and are returned as part of the string.

NOTE: In the "null terminated" case, the terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence 0x00 0x00

NOTE: If the encoded string includes a BOM that is considered part of the length. For example, to decode the UTF-16BE sequence 0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 as the string ABC a byteLength of 8 must be specified.

An exception (TBD) is raised if the method would write past the end of the underlying buffer. If the binary data or data range is not valid according to the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.

detectEncoding

This method attempts to determine the encoding of the string at the specified offset byteOffset with length in bytes byteLength in array. The array parameter must be an ArrayBuffer or DataView, otherwise a TypeError exception is raised. If array is a DataView, the byteOffset is in addition to any offset between the DataView and the underlying ArrayBuffer. An exception (TBD) is raised if the method would read past the end of the underlying buffer.

TODO: outline subset of HTML5 "encoding sniffing algorithm" to use.

The return value is a DOMString containing an encoding name suitable for use with the other methods. If the encoding cannot be determined with a high level of confidence, the method must return the empty string.

NOTE: A length must be specified for detectEncoding as null termination is dependent on the encoding.encode

Encodes the string value into the specified array at the given byteOffset, using the specified encoding. The array parameter must be an ArrayBuffer or DataView, otherwise a TypeError exception is raised. If array is a DataView, the byteOffset is in addition to any offset between the DataView and its underlying ArrayBuffer. The return value is the length of the encoded string, in bytes. No "null terminator" is added to the string, although a trailing \x00 character in the string will be encoded if present and if it can be expressed in the specified encoding.  An exception (TBD) is raised if the method would write past the end of the underlying buffer. If value cannot be encoded with the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.
If an exception is thrown by this method, the target buffer MUST NOT be changed.

ISSUE: Alternately, we could allow "partial fill" and return an object with bytesWritten and charactersWritten properties.

encodedLength

Computes and returns the length, in bytes, of the string value if it were to be encoded using the specified encoding. If value cannot be encoded with the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.

NOTE: If the encoding includes a BOM, the length of the BOM is included. For example, the string ABC may be encoded in UTF-16 as the octets 0xFE 0xFF 0x00 0x65 0x00 0x66 0x00 0x67 and have a length of 8.

ExamplesExample #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings) {
  var len, i, bytes, view, offset;

  len = 4;
  for (i = 0; i < strings.length; i += 1) {
    len += 4;
    len += stringEncoding.encodedLength("UTF-8", strings[i]);
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len = stringEncoding.encode(view,
                                offset + Uint32Array.BYTES_PER_ELEMENT,
                                "UTF-8", strings[i]);
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT + len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer) {
  var view, offset, num_strings, strings, i, len;

  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = stringEncoding.decode(view, offset, len,
                                             "UTF-8");
    offset += len;
  }
  return strings;
}

Encodings

Encoding names are case-insensitive.

Standard Encodings

Implementations MUST support all of the following encodings:

ASCII

decode: exception thrown if any octet in array is greater than 0x7F
encode: exception thrown if value string contains a character beyond U+007F

ISO-8859-1

decode: No encoding-specific exceptions thrown
encode: exception thrown if value string contains a character beyond U+00FF

BINARY

decode: No encoding-specific exceptions are thrown
encode: exception thrown if value string contains a character beyond U+00FF

NOTE: ISO-8859-1 and BINARY are functionally identical in this specification. Both are included so that callers can be more explicit about the type of data being handled. Storing binary data in ECMAScript strings, one byte per character, was a common approach before Typed Array support was available. This "BINARY" encoding allows for easy interoperation with this legacy style of binary storage.

UTF-8

decode: BOM accepted (0xEF 0xBB 0xBF), exception thrown on invalid/truncated UTF-8 sequence; non-BMP characters in the UTF-8 encoded string yield UTF-16 surrogate pairs in the DOMString
encode: BOM is not written. Exception (TBD) thrown when there is no valid UTF-8 encoding of the string (e.g. "abc\uD800def" which contains a UTF-16 "surrogate half")

UTF-16

decode: exception thrown if BOM not present
encode: outputs a BOM prefix; can be either LE or BE. Implementations may choose to always use the same endianness, or may match the machine architecture for better performance. Callers should not make assumptions about the endianness, and should use the UTF-16BE or UTF-16LE encodings if a specific endianness is desired.

UTF-16LE

decode: BOM not required, but accepted (0xFF 0xFE); throws if incorrect BOM found or overall length is odd number of bytes
encode: does not write a BOM

ISSUE: throw if invalid surrogate pair encountered?UTF-16BE

decode: BOM not required, but accepted (0xFE 0xFF); throws if incorrect BOM found or overall length is odd number of bytes
encode: does not write a BOM

ISSUE: throw if invalid surrogate pair encountered?Other Encodings

Browsers MAY support additional encodings.

TODO: Suggest other encodings - the suggested default encodings table from http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#encoding-sniffing-algorithm may be prove handy.

Acknowledgements

Alan Chaney
Ben Noordhuis
Kenneth Russell (Google, Inc)
Robert Mustacchi
Ryan Dahl

Appendix

A "shim" implementation of the API in JavaScript can be found at:

https://gist.github.com/1339793

@@ Line 61: / Line 61: @@
 === <code>decode</code> ===
-This method decodes a string at the given <var>byteOffset</var> and <var>byteLength</var>, using the specified <var>encoding</var>. The <var>array</var> parameter must be an <code>ArrayBuffer</code> or <code>DataView</code>, otherwise a <code>TypeError</code> exception is raised. If array is a <code>DataView</code>, the <var>byteOffset</var> is in addition to any offset between the <code>DataView</code> and the underlying <code>ArrayBuffer</code>. An exception ''(TBD)'' is raised if the method would write past the end of the underlying buffer. If the binary data or data range is not valid according to the specified encoding an exception ''(TBD)'' is raised. If the specified encoding is not known, an exception ''(TBD)'' is raised.
+This method decodes a string at the given <var>byteOffset</var> and <var>byteLength</var>, using the specified <var>encoding</var>. The <var>array</var> parameter must be an <code>ArrayBuffer</code> or <code>DataView</code>, otherwise a <code>TypeError</code> exception is raised. If array is a <code>DataView</code>, the <var>byteOffset</var> is in addition to any offset between the <code>DataView</code> and the underlying <code>ArrayBuffer</code>.
-:''NOTE: If the encoded string includes a BOM that is considered part of the length.''
+Data is decoded starting at <var>byteOffset</var>. If <var>byteLength</var> is a negative number, decoding continues until a '''U+0000''' character is decoded; the terminal '''U+0000''' character MUST NOT be included in the returned string. If <var>byteLength</var> is a positive number, decoding continues until <var>byteLength</var> bytes have been processed and '''U+0000''' characters have no special meaning and are returned as part of the string.
-If <var>byteLength</var> is a negative number, data is instead decoded until a code point value of 0 in the specified encoding is encountered. For example, in the UTF-16 encodings this would be two <code>0x00</code> bytes, encoding '''U+0000'''. In this case, the terminal '''U+0000''' character is not included in the returned string. If <var>byteLength</var> is a positive number, '''U+0000''' characters have no special meaning and are returned with the string.
+:''NOTE: In the "null terminated" case, the terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence <code>0x00 0x00</code>''
+:''NOTE: If the encoded string includes a BOM that is considered part of the length. For example, to decode the UTF-16BE sequence <code>0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43</code> as the string <code>ABC</code> a ''byteLength'' of 8 must be specified.''
+An exception ''(TBD)'' is raised if the method would write past the end of the underlying buffer. If the binary data or data range is not valid according to the specified encoding an exception ''(TBD)'' is raised. If the specified encoding is not known, an exception ''(TBD)'' is raised.
 === <code>detectEncoding</code> ===

StringEncoding: Difference between revisions

Revision as of 17:36, 4 November 2011

Contents

Editors

Abstract

API

`decode`

`detectEncoding`

`encode`

`encodedLength`

Examples

Example #1 - encoding strings

Example #2 - decoding strings

Encodings

Standard Encodings

`ASCII`

`ISO-8859-1`

`BINARY`

`UTF-8`

`UTF-16`

`UTF-16LE`

`UTF-16BE`

Other Encodings

Acknowledgements

Appendix

Navigation menu

StringEncoding: Difference between revisions

Revision as of 17:36, 4 November 2011

Editors

Abstract

API

decode

detectEncoding

encode

encodedLength

Examples

Example #1 - encoding strings

Example #2 - decoding strings

Encodings

Standard Encodings

ASCII

ISO-8859-1

BINARY

UTF-8

UTF-16

UTF-16LE

UTF-16BE

Other Encodings

Acknowledgements

Appendix

Navigation menu

Search

`decode`

`detectEncoding`

`encode`

`encodedLength`

`ASCII`

`ISO-8859-1`

`BINARY`

`UTF-8`

`UTF-16`

`UTF-16LE`

`UTF-16BE`