A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

StringEncoding: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
Line 195: Line 195:
:''ISSUE: Should anything be said about canonical forms in the Unicode encodings?''
:''ISSUE: Should anything be said about canonical forms in the Unicode encodings?''


Implementations SHOULD NOT support encodings not listed in the specification, to avoid interoperability issues.
Implementations MUST NOT support encodings not listed in the specification, to avoid interoperability issues.


====<code>ASCII</code>====
====<code>ASCII</code>====

Revision as of 19:52, 14 March 2012

Proposed String Encoding API for Typed Arrays

Editors

  • Joshua Bell (Google, Inc)

Abstract

This specification defines an API for encoding strings to binary data, and decoding strings from binary data.

NOTE: This specification intentionally does not address the opposite scenario of encoding binary data as strings and decoding binary data from strings, for example using Base64 encoding.

Discussion on this topic has so far taken place on the [email protected] mailing list. See http://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html for the initial discussion thread.

Discussion has since moved to the WHATWG spec discussion mailing list. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html for the latest discussion thread.

Open Issues

General: Rewrite in terms of http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

Scenarios

  • Encode as many characters as possible into a fixed-size buffer for transmission, and repeat starting with next unencoded character
  • Parse/emit legacy data formats that do not use UTF-8

Desired Features

  • Allow arbitrary end byte sequences (e.g. 0xFF for UTF-8 strings)
  • Conversion errors should either produce replacement characters (U+FFFD, etc) or API should allow selection between replacement vs. throwing, or some other means of reporting errors (e.g. a count)

API cleanup

  • Support two versions of encode; one which takes target buffer, and one which creates/returns a right-sized buffer

Spec Issues

  • Resolve behavior when writing to a buffer that's too small - partial fill or don't change?
  • UTF-16 default endianness should be specified, not left to implementations
  • Do not include UTF-16 BOMs
  • Explicitly enumerate supported encodings

API

Scripts in pages access the API through the top-level window.stringEncoding object which holds methods for encoding/decoding strings. Worker scripts can similarly use the self.stringEncoding object. (Since window and self are the page and worker global object, respectively, scripts can simply refer to stringEncoding without a prefix.)

WebIDL

partial interface Window {
   readonly attribute StringEncoding stringEncoding;
};

partial interface WorkerUtils {
   readonly attribute StringEncoding stringEncoding;
};

The stringEncoding object exposes static methods for encoding and decoding strings from objects containing binary data as specified in the Typed Array specification.

WebIDL

interface StringEncoding {

 DOMString decode(ArrayBufferView view,
                  optional DOMString encoding)
                        raises(DOMException);

 DOMString stringLength(ArrayBufferView view,
                        optional DOMString encoding)
                                   raises(DOMException);

 unsigned long encode(DOMString value, 
                      ArrayBufferView view,
                      optional DOMString encoding)
                          raises(DOMException);

 unsigned long encodedLength(DOMString value,
                             optional DOMString encoding)
                                 raises(DOMException);
}

decode

This method decodes a string into the ArrayBufferView view, using the specified encoding. The view's byteOffset and byteLength attributes are used to offset and limit the decoding operation against the view's underlying ArrayBuffer buffer. If the encoding parameter is omitted it defaults to "utf-8".

NOTE: U+0000 characters have no special meaning and are returned as part of the string.
NOTE: If the encoded string includes a BOM that is considered part of the length. For example, to decode the UTF-16BE sequence 0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 as the string ABC a byteLength of 8 must be specified.

If the binary data or length is not valid according to the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.

ISSUE: If a BOM is present, is it returned as part of the string? FileReader.readAsText() strips the BOM.
ISSUE: Behavior if decoding stops inside a multi-byte sequence.

stringLength

This method determines the length of a "null-terminated" string in the given view, using the specified encoding. If the encoding parameter is omitted it defaults to "utf-8".

Decoding proceeds until a U+0000 character is decoded or view.byteLength bytes have been processed. If a U+0000 character is decoded, the number of bytes processed, not including those representing the U+0000 character, are returned. If byteLength bytes are processed without a U+0000 character decoded, -1 is returned.

NOTE: The byte sequence representing terminator is encoding specific. For example, in UTF-16 encodings it would be the even-aligned two-octet sequence 0x00 0x00
NOTE: If the encoded string includes a BOM that is considered part of the length. For example, stringLength would return a length of 8 for the UTF-16BE sequence 0xFE 0xFF 0x00 0x41 0x00 0x42 0x00 0x43 0x00 0x00.

If the specified encoding is not known, an exception (TBD) is raised. If the specified encoding does not support encoding U+0000 an exception (TBD) is raised.

ISSUE: Add an optional unsigned short terminator member, defaults to 0?
ISSUE: To allow terminators which aren't code points (e.g. 0xFF in UTF-8), make the optional terminator either a code point (default 0) or an Array of octets (e.g. [ 0xFF, 0XFF ] ?

encode

Encodes the string value into the specified ArrayBufferView view, using the specified encoding. If the encoding parameter is omitted it defaults to "utf-8". The return value is the length of the encoded string, in bytes. No "null terminator" is added to the string, although any trailing \x00 character present in value would be encoded, if expressible in the encoding. An exception (TBD) is raised if the method would write more than view.byteLength bytes. If value cannot be encoded with the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.

If an exception is thrown by this method, the target buffer MUST NOT be changed.

ISSUE: Alternately, we could allow "partial fill" and return an object with bytesWritten and charactersWritten properties.

encodedLength

Computes and returns the length, in bytes, of the string value if it were to be encoded using the specified encoding. If the encoding parameter is omitted it defaults to "utf-8". If value cannot be encoded with the specified encoding an exception (TBD) is raised. If the specified encoding is not known, an exception (TBD) is raised.

NOTE: If the encoding includes a BOM, the length of the BOM is included. For example, the string ABC may be encoded in UTF-16 as the octets 0xFE 0xFF 0x00 0x65 0x00 0x66 0x00 0x67 and have a length of 8.

Examples

Example #1 - encoding strings

The following example uses the API to encode an array of strings into a ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32), followed by the length of the first string (as a Uint32), the UTF-8 encoded string data, the length of the second string (as a Uint32), the string data, and so on.

function encodeArrayOfStrings(strings) {
  var len, i, bytes, view, offset;

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    len += stringEncoding.encodedLength(strings[i], "utf-8");
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len = stringEncoding.encode(strings[i], 
                                new DataView(bytes.buffer, offset + Uint32Array.BYTES_PER_ELEMENT),
                                "utf-8");
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT + len;
  }
  return bytes.buffer;
}

Example #2 - decoding strings

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer) {
  var view, offset, num_strings, strings, i, len;

  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = stringEncoding.decode(new DataView(buffer, offset, len), 
                                       "utf-8");
    offset += len;
  }
  return strings;
}

Encodings

Encoding names are case-insensitive.

Standard Encodings

Implementations MUST support all of the following encodings:

ISSUE: Should anything be said about canonical forms in the Unicode encodings?

Implementations MUST NOT support encodings not listed in the specification, to avoid interoperability issues.

ASCII

  • decode: exception thrown if any octet in array is greater than 0x7F
  • encode: exception thrown if value string contains a character beyond U+007F

ISO-8859-1

  • decode: No encoding-specific exceptions thrown
  • encode: exception thrown if value string contains a character beyond U+00FF

BINARY

  • decode: No encoding-specific exceptions are thrown
  • encode: exception thrown if value string contains a character beyond U+00FF
NOTE: ISO-8859-1 and BINARY are functionally identical in this specification. Both are included so that callers can be more explicit about the type of data being handled. Storing binary data in ECMAScript strings, one byte per character, was a common approach before Typed Array support was available. This "BINARY" encoding allows for easy interoperation with this legacy style of binary storage.

UTF-8

  • decode: BOM accepted (0xEF 0xBB 0xBF), exception thrown on invalid/truncated UTF-8 sequence; non-BMP characters in the UTF-8 encoded string yield UTF-16 surrogate pairs in the DOMString. BOM is not returned as part of the DOMString.
  • encode: BOM is not written. Exception (TBD) thrown when there is no valid UTF-8 encoding of the string (e.g. "abc\uD800def" which contains a UTF-16 "surrogate half")
ISSUE: Are over-long UTF-8 sequences allowed?

UTF-16

  • decode: An exception thrown if BOM not present. BOM is not returned as part of the DOMString.
  • encode: outputs a BOM prefix; can be either LE or BE. Implementations may choose to always use the same endianness, or may match the machine architecture for better performance. Callers should not make assumptions about the endianness, and should use the UTF-16BE or UTF-16LE encodings if a specific endianness is desired.

UTF-16LE

  • decode: BOM not required, but accepted (0xFF 0xFE); throws if incorrect BOM found or overall length is odd number of bytes. BOM is not returned as part of the DOMString.
  • encode: does not write a BOM
ISSUE: throw if invalid surrogate pair encountered?

UTF-16BE

  • decode: BOM not required, but accepted (0xFE 0xFF); throws if incorrect BOM found or overall length is odd number of bytes. BOM is not returned as part of the DOMString.
  • encode: does not write a BOM
ISSUE: throw if invalid surrogate pair encountered?

Acknowledgements

  • Alan Chaney
  • Ben Noordhuis
  • Glenn Maynard
  • John Tamplin
  • Kenneth Russell (Google, Inc)
  • Robert Mustacchi
  • Ryan Dahl


Appendix

A "shim" implementation in JavaScript (that may not fully match the current version of the spec) plus some initial unit tests can be found at:

http://code.google.com/p/stringencoding/