A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

Difference between revisions of "URL"

From WHATWG Wiki
Jump to: navigation, search
(add IDNA notes)
(IDNA: update notes)
Line 56: Line 56:
 
==IDNA==
 
==IDNA==
  
IDNA2003 below is IDNA2003 with updated Unicode (in theory IDNA2003 restricts Unicode to 3.2?)
+
Definitions:
 +
 
 +
* IDNA2003+: IDNA2003 with Unicode updated to the latest version. (So not NFKC from Unicode 3.2., although [http://docs.python.org/2/library/unicodedata.html#unicodedata.ucd_3_2_0 Python might do that]... ) Restrictions on display might be in place.
 +
* IDNA2008+: IDNA2008 with [http://tools.ietf.org/html/rfc5895#section-2 RFC 5895 section 2] mapping and IDNA2003 domain label separators. Display is restricted to IDNA2008, lookup is unrestricted (everything gets Punycoded).
 +
 
 +
Implementations:
 +
 
 +
* IDNA2003+: Safari, Chrome, Firefox, Internet Explorer
 +
* IDNA2008+: Opera
 +
 
 +
Tests:
  
* Opera: http://www.alvestrand.no/pipermail/idna-update/2012-November/007455.html (IDNA2008 + deviations); email does not mention domain label separators or fullwidth mapping, presumably http://tools.ietf.org/html/rfc5895#section-2 is implemented too (though not entirely, see label separators)
 
* Firefox: IDNA2003
 
** https://bugzilla.mozilla.org/show_bug.cgi?id=479520 (implement IDNA2008)
 
* Safari/Chrome: IDNA2003
 
* Internet Explorer: ?
 
 
* http://mathias.html5.org/tests/url/idna2003-separators/ IDNA2003 domain label separators are supported everywhere
 
* http://mathias.html5.org/tests/url/idna2003-separators/ IDNA2003 domain label separators are supported everywhere
* Browsers have no restrictions in the ASCII range either. E.g. labels with underscores work.
 
  
What algorithms do we need. ToLabels(domain string) -> list of labels (trailing dot) or failure. ToASCII(label) -> ASCII-label. ToUnicode(label) -> Unicode label. ToLabels should do validation and such too. ToASCII and ToUnicode ideally never fail because ToLabels already ensured validity.
+
Required algorithms:
 +
 
 +
* ToLabels(domain string) -> ASCII-label list (empty label at the end signifies trailing dot) or failure.  
 +
* ToASCII(Unicode-label) -> ASCII-label.
 +
* ToUnicode(ASCII-label) -> Unicode label.
 +
 
 +
(For convenience maybe ToASCII and ToUnicode should accept lists too.)
 +
 
 +
Notes:
 +
 
 +
* Input to DNS is a byte array. (This means that "_" and byte 0x03 can be valid input. Not sure whether "." works within a label. Higher than 0x7F cannot happen if IDNA is used.)
 +
* DNS is of course not the only system in place, but browsers do not seem to care as far as mapping is concerned.
 +
* http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0036.html http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0057.html
 +
* http://tools.ietf.org/html/rfc6055 has historical deliberations
  
 
[[Category:Spec coordination]]
 
[[Category:Spec coordination]]

Revision as of 11:15, 18 November 2012

This documents research and notes around URLs for the URL standard.

Implementations

Tests

Variants of the following code (runs in Live DOM Viewer) are useful to test which code points are URL escaped in browsers:

<!DOCTYPE html>
<script>
var a = document.createElement("a")

i = 0
cp = 0x100

while ( i < cp ) {
  a.href = "http://x" + String.fromCharCode(i) + "@x/"
  if(a.href.length != "http://x)@x/".length) {
    w(a.href)
  }
  i++
}
</script>

Parsing

JavaScript libraries

For improving the API we might want to take inspiration from:

Schemes

Currently the parser does not separate out query, this could be problematic for about and maybe mailto.

  • data
  • javascript
  • mailto
  • about (uselessly defined in RFC 6694)

IDNA

Definitions:

  • IDNA2003+: IDNA2003 with Unicode updated to the latest version. (So not NFKC from Unicode 3.2., although Python might do that... ) Restrictions on display might be in place.
  • IDNA2008+: IDNA2008 with RFC 5895 section 2 mapping and IDNA2003 domain label separators. Display is restricted to IDNA2008, lookup is unrestricted (everything gets Punycoded).

Implementations:

  • IDNA2003+: Safari, Chrome, Firefox, Internet Explorer
  • IDNA2008+: Opera

Tests:

Required algorithms:

  • ToLabels(domain string) -> ASCII-label list (empty label at the end signifies trailing dot) or failure.
  • ToASCII(Unicode-label) -> ASCII-label.
  • ToUnicode(ASCII-label) -> Unicode label.

(For convenience maybe ToASCII and ToUnicode should accept lists too.)

Notes: