A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

URL: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(add some proposed terminology)
 
(33 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This documents research and notes around the [http://dvcs.w3.org/hg/url/raw-file/tip/Overview.html URL specification].
{{CC0 page}}
 
This documents research and notes around URLs for the [http://url.spec.whatwg.org/ URL standard].


==Implementations==
==Implementations==
Line 7: Line 9:
* http://trac.webkit.org/browser/trunk/Source/WebCore/platform/KURLGoogle.cpp
* http://trac.webkit.org/browser/trunk/Source/WebCore/platform/KURLGoogle.cpp
* http://trac.webkit.org/browser/trunk/Source/WebCore/platform/network/DataURL.cpp (data URLs)
* http://trac.webkit.org/browser/trunk/Source/WebCore/platform/network/DataURL.cpp (data URLs)
* http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsStandardURL.cpp
* http://mxr.mozilla.org/mozilla-central/source/dom/src/jsurl/nsJSProtocolHandler.cpp (javascript URLs)
* http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prnetdb.c#1544 (IPv6)
* https://code.google.com/p/chromium/codesearch


==Tests==
==Tests==
Line 12: Line 18:
* https://github.com/cweb/iri-tests
* https://github.com/cweb/iri-tests


== Terminology ==
Variants of the following code (runs in Live DOM Viewer) are useful to test which code points are URL escaped in browsers:
<pre><!DOCTYPE html>
<script>
var a = document.createElement("a")
 
i = 0
cp = 0x100
 
while ( i < cp ) {
  a.href = "http://x" + String.fromCharCode(i) + "@x/"
  if(a.href.length != "http://x)@x/".length) {
    w(a.href)
  }
  i++
}
</script></pre>
 
==Parsing==
 
* https://github.com/annevk/url/blob/master/url.js
* http://lists.w3.org/Archives/Public/public-whatwg-archive/2012Sep/0305.html has notes on file URLs in Gecko.
 
==JavaScript libraries==
 
For improving the API we might want to take inspiration from:
 
* http://medialize.github.com/URI.js/
* https://github.com/joyent/node/blob/master/doc/api/url.markdown
* https://github.com/bestiejs/punycode.js (just Punycode)
 
==Schemes==
 
Apart from the scheme-types listed below, the URL Standard identifies "relative schemes", used for parsing a URL into a parsed URL.
 
=== Purpose-specific schemes ===
 
URL schemes are purpose-specific schemes if they only work in one context. These only work for WebSocket:
 
* ws
* wss
 
=== Fetch schemes ===
 
URL schemes are resource schemes if fetching the URL results in either a network error or a resource with associated MIME type (potentially sniffed).
 
; ftp
; http
; https : These all can be used by the corresponding protocol directly.
; file : Needs platform-specific interpretation and mapping to a resource on a the local file system.
; data : Needs its resource and MIME type information retrieved from its scheme data/query.
; blob
; about : The resource is effectively the result of passing scheme data to a hash table (not sure if case-sensitive or not; definitely no percent decoding). Query and fragment can be used by the resource.
 
(The same-origin definition should maybe account for about/blob/data.)
 
=== Navigate schemes ===
 
* The "fetch schemes" -> use "fetch"
* javascript
* Not the "purpose-specific schemes" -> error
* All other schemes (including "external schemes")
 
=== External schemes ===
 
Depending on the context, schemes not listed above will either launch an external application or result in a network error. Examples:
 
* mailto
* skype
 
==IDNA==
 
=== Definitions ===
 
* IDNA2003+: IDNA2003 with Unicode updated to the latest version. (So not NFKC from Unicode 3.2., although [http://docs.python.org/2/library/unicodedata.html#unicodedata.ucd_3_2_0 Python might do that]... ) Restrictions on display might be in place.
* IDNA2008+: IDNA2008 with [http://tools.ietf.org/html/rfc5895#section-2 RFC 5895 section 2] mapping and IDNA2003 domain label separators. Display is restricted to IDNA2008, lookup is unrestricted (everything gets Punycoded).
 
=== Implementations ===
 
* IDNA2003+: Safari, Chrome, Firefox
** Changing, see e.g. https://codereview.chromium.org/23642003#msg4
* IDNA2008+: Internet Explorer?
 
=== Tests ===
 
* http://mathias.html5.org/tests/url/idna2003-separators/ IDNA2003 domain label separators are supported everywhere
 
=== Algorithms ===


;URL string: What you find in attribute values, property values, method parameters, etc.
* ToLabels(domain string) -> ASCII-label list (empty label at the end signifies trailing dot) or failure.  
;parse a URL string ''url'' using base URL ''base'': Turning a URL string into a URL by using a base URL.
* ToASCII(Unicode-label) -> ASCII-label.
;URL: An in-memory representation of a URL with various properties as elaborated on by model below.
* ToUnicode(ASCII-label) -> Unicode label.
;URL interface/object: JavaScript representation of a URL.


==Model==
(For convenience maybe ToASCII and ToUnicode should accept lists too.)


URL (.href)
=== UI ===
- invalid?
- scheme (.protocol)
- authority
  - username (proposed .username)
  - password (proposed .password)
  - ip/host (.hostname)
  - port (.port)
- path (.pathname)
- query (.search)
- fragment (.hash)


==Parsing==
Note that this has potential security implications too, but does not matter for interoperability.
 
* http://www.chromium.org/developers/design-documents/idn-in-google-chrome (also includes summary for other browsers)
* https://wiki.mozilla.org/IDN_Display_Algorithm
* http://www.alvestrand.no/pipermail/idna-update/2011-December/date.html (has lots of background discussion)
* https://bugs.webkit.org/show_bug.cgi?id=126627
 
=== Notes ===


parse (urlstr, optional baseURL)
* Input to DNS is a byte array. (This means that "_" and byte 0x03 can be valid input. Not sure whether "." works within a label. Higher than 0x7F cannot happen if IDNA is used.)
  url = new URL
* DNS is of course not the only system in place, but browsers do not seem to care as far as mapping is concerned.
  tokenize(urlstr)
* http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0036.html http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0057.html
* http://tools.ietf.org/html/rfc6055 has historical deliberations
  SCHEME CHECK START
    if char is in ALPHA
      buffer += char
      -> SCHEME CHECK NEXT
    else
      unconsume char
      -> NO SCHEME
  SCHEME CHECK NEXT
    if char is in ALPHA / DIGIT / "+" / "-" / "."
      buffer += char
      -> continue
    elif char is ":"
      url.scheme = buffer.toASCIILowercase()
      buffer = ""
      -> SCHEME
    else:
      input.reset()
      -> NO SCHEME
  SCHEME
    if url.scheme is not hierarchical (data:)
      -> NON-HIERARCHICAL
    elif baseURL and url.scheme is baseURL.scheme (http:?test)
      -> RELATIVE
    else  (https://test.com/)
      -> AUTHORITY START
  NO SCHEME
    if not baseURL or baseURL.scheme is not hierarchical
      url.invalid = true
      return url
    else
      -> RELATIVE
  NON-HIERARCHICAL (could merge with PATH)
    if curChar is "#"
      FRAGMENT
    else
      ...
  RELATIVE
    if char is EOI (end-of-input)
      url = baseURL
      url.fragment = null
      exit
    elif char is "/" or char is "\"
      if next char "/" or next char is "\"
        url.scheme = baseURL.scheme
        -> AUTHORITY START
      else
        url.scheme = baseURL.scheme
        url.authority = baseURL.authority
        -> PATH
    elif char is "?"
        url.scheme = baseURL.scheme
        url.authority = baseURL.authority
        url.path = baseURL.path
        -> QUERY
    elif char is "#"
        url.scheme = baseURL.scheme
        url.authority = baseURL.authority
        url.path = baseURL.path
        url.query = baseURL.query
        -> FRAGMENT
    else
      url.scheme = baseURL.scheme
      url.authority = baseURL.authority
      prepend input by baseURL.path up to the last /
      -> PATH
  AUTHORITY START
    if char is "/" or char is "\"
      -> continue
    else
      -> AUTHORITY
  AUTHORITY
    ...
  PATH
    if char is "?"
      -> QUERY
    if char is "#"
      -> FRAGMENT
    else
      buffer += char
  QUERY
    if char is "#"
      -> FRAGMENT
  FRAGMENT
    ...


[[Category:Spec coordination]]
[[Category:Spec coordination]]
== Requests and Issues ==
=== How to compare URLs ===
There is at least one specification (being implemented) that needs a reference for how to compare URLs for equivalency.
* http://microformats.org/wiki/representative-hcard-parsing
Hence, request:
New section/feature in the URL spec:
"How to compare URLs"
with something like: "parse them first and then compare the serialization"
See: http://krijnhoetmer.nl/irc-logs/whatwg/20141006#l-843
for background / discussion.

Latest revision as of 15:31, 20 June 2015

The contents of this page, URL, and all edits made to this page in its history, are hereby released under the CC0 Public Domain Dedication, as described in WHATWG Wiki:Copyrights.

This documents research and notes around URLs for the URL standard.

Implementations

Tests

Variants of the following code (runs in Live DOM Viewer) are useful to test which code points are URL escaped in browsers:

<!DOCTYPE html>
<script>
var a = document.createElement("a")

i = 0
cp = 0x100

while ( i < cp ) {
  a.href = "http://x" + String.fromCharCode(i) + "@x/"
  if(a.href.length != "http://x)@x/".length) {
    w(a.href)
  }
  i++
}
</script>

Parsing

JavaScript libraries

For improving the API we might want to take inspiration from:

Schemes

Apart from the scheme-types listed below, the URL Standard identifies "relative schemes", used for parsing a URL into a parsed URL.

Purpose-specific schemes

URL schemes are purpose-specific schemes if they only work in one context. These only work for WebSocket:

  • ws
  • wss

Fetch schemes

URL schemes are resource schemes if fetching the URL results in either a network error or a resource with associated MIME type (potentially sniffed).

ftp
http
https
These all can be used by the corresponding protocol directly.
file
Needs platform-specific interpretation and mapping to a resource on a the local file system.
data
Needs its resource and MIME type information retrieved from its scheme data/query.
blob
about
The resource is effectively the result of passing scheme data to a hash table (not sure if case-sensitive or not; definitely no percent decoding). Query and fragment can be used by the resource.

(The same-origin definition should maybe account for about/blob/data.)

Navigate schemes

  • The "fetch schemes" -> use "fetch"
  • javascript
  • Not the "purpose-specific schemes" -> error
  • All other schemes (including "external schemes")

External schemes

Depending on the context, schemes not listed above will either launch an external application or result in a network error. Examples:

  • mailto
  • skype

IDNA

Definitions

  • IDNA2003+: IDNA2003 with Unicode updated to the latest version. (So not NFKC from Unicode 3.2., although Python might do that... ) Restrictions on display might be in place.
  • IDNA2008+: IDNA2008 with RFC 5895 section 2 mapping and IDNA2003 domain label separators. Display is restricted to IDNA2008, lookup is unrestricted (everything gets Punycoded).

Implementations

Tests

Algorithms

  • ToLabels(domain string) -> ASCII-label list (empty label at the end signifies trailing dot) or failure.
  • ToASCII(Unicode-label) -> ASCII-label.
  • ToUnicode(ASCII-label) -> Unicode label.

(For convenience maybe ToASCII and ToUnicode should accept lists too.)

UI

Note that this has potential security implications too, but does not matter for interoperability.

Notes

Requests and Issues

How to compare URLs

There is at least one specification (being implemented) that needs a reference for how to compare URLs for equivalency.

Hence, request:

New section/feature in the URL spec:

"How to compare URLs"

with something like: "parse them first and then compare the serialization"

See: http://krijnhoetmer.nl/irc-logs/whatwg/20141006#l-843 for background / discussion.