Extensions

Ways to arbitrarily extend text/html for new vocabularies

Please put ideas for what it should look like here, each in their own section.

Each example should explain in details (ideally with examples) how to handle:

Syntax errors at the tokeniser level, the tree construction level, and the schema level.
Existing content that happens to use elements or syntax that you are proposing have special processing rules.
Pages that contain any special syntax after that syntax was copied and pasted by an ignorant Web author from a valid page written by a competent Web author aware of the new syntax.

See also SVG-specific proposals in Diagrams in HTML.

Proposal 1: xmlns strawman

When you hit an element with an xmlns="" attribute, switch to an XML parser until that parser has parsed the matching end tag.

 bla bla text/html bla bla <foo xmlns="http://example.com/foo"><this><must/>
 be<valid>XML! </valid></this> must be.</foo> bla bla text/html

Errors cause the entire page to stop parsing.

Existing pages are not handled.

Pages that copy-and-paste this syntax then use it incorrectly are not handled.

Reasons why we can't do this

There are pages that already specify xmlns="" attributes that would break if the content were processed as XML. For example, http://www.live.com/.

Is there (much?) existing content with arbitrary namespaces, or could we enumerate a set of junk namespaces to ignore, and then do this for others?

Or could we at least do it for a whitelist of namespaces?

Probably, xmlns="" attribute, when used for HTML5 extensibility purposes, should be clearly marked as such, to disambiguate from legacy uses. For example, it could be explicitly declared at the root of the document:

  <html xmlns:xmlns="urn:html5:xmlns:for-example">
     ...
     <foo xmlns="http://example.com/foo">
       <!-- the region of the "foo" extension -->
     </foo>
     ...
  </html>

Is it OK to switch parsing mode mid-document like this (and to effectively require an XML parser in every UA)?

Proposal 2: Extensibility Element

This is a proposed generic extensibility point, for SVG and possibly MathML or other XML content, with the temporary placeholder name of <ext> (the real element name would be some token that doesn't clash with existing content). Inside you can use XML or another format. Naturally, any content placed in an <ext> element would have to be understood by the UA in order to render correctly, and more complex rules may need to be developed for specific kinds of interaction between the root document and the inline content, such as with script, CSS, etc. For some discussion, see the IRC logs.

 <p>Hello world. 
    <ext>
       <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 10 10">
          <circle x="5" y="5 r="5" stroke="green"/>
       </svg>
    </ext> 
 </p>

We should define a content model for where the <ext> element can occur, and if there are implications for different locations (such as inside a table, a paragraph, the head, etc). The simplest thing, at least for SVG (and probably MathML), would be that it would have the same restrictions as an <img> element. Also, there should be a default block model for <ext> in CSS.

Please define this in enough detail that I can construct a tokeniser and tree constructor from the description. Ideally, just provide the actual tokeniser and tree constructor that you are proposing. This is what it looks like in the HTML5 spec (sections 8.2.3 and 8.2.4). It's probably easiest to define it as a delta from what the spec has today.

Notes:

This is similar to IE's "XML islands" with the <xml> element. It's believed that there are some conflicts with the <xml> element itself, since it creates a separate document that is tied to the <xml> element in the DOM, but more research is needed. See also Using XML Data Islands in Mozilla.

The <ext> element could potentially be an implicit element, generated by the HTML5 parser on encountering a start tag of e.g. "<svg " or "<math ". That would save authors of having to type this extra element, but has a drawback in that it doesn't provide fallback content for legacy UAs. -Ed

We could specify exactly what flavors of markup must be supported by a UA, and which may be supported by a UA. This would be rather restrictive, but could improve interoperability of UA features, and would ensure that the proper DOM interfaces are available. For example, SVG and MathML must be supported, and FooML may be supported (or something).

Error Handling

Pick one! Or separate the proposal into several proposals, for each different proposal, so that they can be evaluated. The proposals below are just brief notes, not detailed enough for me to know what you mean. -Hixie

Notes:
The main options seem to be:
# strict XML parsing (not favored by many)
# very permissive error handling (as in HTML5); this idea is controversial 
  and has many open issues, which should be detailed below 
# moderate error handling, as detailed in SVG Tiny 1.2 
# other ideas?

The chief risk with permissive error handling is that it would create 
content that is not compatible across different UAs, including mobile 
devices and authoring tools.

Proposal:

Tree construction recovers from errors by closing the <svg> element, and not rendering any content after the error.
Case folding is not supported within the main body of the <ext> element, though it would be within the <fallback> element.
The tree builder would assign the appropriate namespace URI to the element and attribute nodes it creates.
Unknown attributes and elements are ignored.
Unquoted attribute values will be ignored (should the element also not be rendered?)
If the "/>" is not found at the end of an element, all subsequent element will be placed as child elements of the element (and thus not rendered) until a matching closing tag is found, or until the a matching root tag is found, or until the "</ext>" element is found.
- If a matching closing tag is found, the element is closed, and subsequent elements are rendered as normal.
- If a matching closing root tag is found, the element is closed, the root tag is closed, and any subsequent elements are ignored if outside a root tag.
- If a matching closing <ext> tag is found, the element is closed, the root tag is closed, the <ext> tag is closed, and HTML processing continues as normal.

Please explain the manner in which you find this aspect incomplete or lacking, with links to relevant sections of the HTML5 spec for reference, and I will try to fix any problems. -Shepazu

One question is how to find the matching tag. Which parsing mode is active after 2-open/1-close:

Notes (not sure this fits in the above proposal):
* Tokenizer recovers from errors by ignoring subsequent content 
  between the error and the closing <ext> tag, closing the <svg> 
  element and <ext> element, and moving on with normal HTML 
  processing; for SVG, any element with errors is not rendered.

See the following emails by Henri Sivonen for comparison and contrast:

Embedded HTML

The case of content inside a <foreignObject> element could be subject to the parsing model of the root document. (Note that this is only a partial solution, and more thought and details are needed.)

For content outside <foreignObject>, it should follow the XML processing rules.

Fallback Behavior

This is an opportunity to get nice fallback behavior, as well.

Here's a possible suggestion, where the raster image would show in UAs that didn't support the <ext> syntax, and the SVG would show in those that did (and which support SVG). In UAs which support <ext> and not SVG, the fallback would also be the raster. The fallback content should be inside a wrapper element (<fallback>), so that you can have rich fallback options, such as an image map, a table, <canvas> and an accompanying <script> element, or whatever; in this case, I also include fallback CSS to hide textual content in title, desc, and text elements, but it may be desirable to leave this content as alternate text to the image, even including styling.

For MathML content, a conditional CSS override could allow for CSS styling of MathML elements for those that don't render MathML natively.

Note: as stated before, the names of the <ext> and <fallback> elements are subject to change based on existing element names in the wild.

<html lang="en">
<head>
	<title>HTML Extensibility Test</title>
</head>
<body>
	<h1 id="test_of_extensibility">Test of Extensibility</h1>
	<p>This is a test of an extensibility point in text/html, with a fallback mechanism.</p>
	<ext>
		<fallback>
			<img src="anIsland.png" alt="..."/>
			<style type='text/css'>
			   svg > * { display: none; }
		        </style>
		</fallback>
		<svg xmlns="http://www.w3.org/2000/svg"
		     xmlns:xlink="http://www.w3.org/1999/xlink"
		     width="100%" height="100%"
		     version="1.1">
			<title>My Title</title>
			<desc>schepers, 01-04-2008</desc>
			<circle id="circle_1" cx="75" cy="25" r="20" fill="lime" />
      			<text id='text_1' x='10' y='25' font-size='18' fill='crimson'>This is some text.</text>
		</svg>		
	</ext>
</body>
</html>

Reasons why we can't do this

It's not clear what the processing model being proposed actually is. However, there is already one problem:

The idea relies on not conflicting with legacy content. Unfortunately, whatever syntax we end up using, people will copy and paste it from documents that were written by competent authors that tested it against the new UAs, into documents written by authors who don't know about this, and who don't have the new UA, thus creating new "legacy documents" that use whatever syntax we come up with. Saying the risk is minimal doesn't mitigate this problem. It's a real problem, and we have to deal with it. Such content clearly wouldn't work in the legacy UAs, and so the mistake will have no reason to propagate (the novice author will have no incentive to copy such content if it doesn't work in their UA); further, this is an issue with any new HTML5 syntax. Please explain in more detail how this is a problem. -Shepazu

Note also that the fallback idea doesn't work. Elements like <script>, <style>, <title>, <input>, <textarea> etc, get treated as HTML elements in legacy UAs. Please expand on this; the <fallback> proposal relies on the fact that any content in the <fallback> element be treated as HTML, so it's not clear what the objection is. -Shepazu

Relying on CSS for hiding the text content doesn't work either, because CSS is optional and might not be enabled (or supported). In this case, legacy browsers that don't support <fallback> and in which CSS (or optionally JS) is not enabled or supported would, unfortunately print the textnodes; this still doesn't break the page, however, and any page which relies on JS or CSS will always fail in such UAs; this is true of almost every Webapp. -Shepazu

(It doesn't much matter, though, because fallback isn't one of the things we're trying to address with this.) This seems like an artificial constraint; any solution which solves another larger problem (even if that's not the problem being addressed) is not a bug, it's a feature. -Shepazu

Proposal 3: XML5

Microsoft has published a whitepaper on the subject of Improved Namespace Support. Salient features:

Windows Internet Explorer 8 Beta 1 for Developers offers Web developers the opportunity to write standards-compliant HTML-based Web pages that support features (such as SVG, XUL, and MathML) in namespaces, provided that the client has installed appropriate handlers for those namespaces via binary behaviors. (A binary behavior is a type of ActiveX control.) Tbroyer: note that those behaviors don't change the way the markup is parsed into a DOM; at least for elements whose name contains a colon (haven't tested this in IE8, but this is the way it is since IE5.5)
Internet Explorer 8 does not support the XHTML namespace definition. Thus, default namespace declarations of XHTML are ignored (xmlns="http://www.w3.org/1999/xhtml"). Tbroyer: this means that you cannot switch from a default namespace back to HTML (actually, this is true in IE8 in a more general fashion: once you've set a default namespace (i.e. once you've leaved "HTML"), you cannot switch to another; the whitepaper describes this as "Nesting of multiple default namespaces is not allowed; in other words, a default namespace declaration inside of another default namespace declaration will be ignored."
Internet Explorer 8 does not support default namespace declarations on any known elements such as HTML, SCRIPT, DIV, or STYLE. If default namespace declarations are encountered on these elements, the declaration is ignored (for purposes of existing Web page compatibility).

A few notes:

While Microsoft's IE8 implementation as described by this whitepaper does not satisfy all of the requirements; the above list focuses on the parts that do.
While Microsoft's implementation is based on ActiveX (Tbroyer: see above, ActiveX give you the behaviors associated with a given namespace URI, but doesn't change the parsing algorithm), the situation could very well end up being similar to XMLHttpRequest whereby the functionality was first exposed via ActiveX, other browser vendors adopted an alternate object model interface to this same functionality, and that interface was later adopted and standardized.
While the white paper does not explicitly state this requirement, the approach works best if the simple name for the unknown (to HTML5) element which contains the default namespace declaration for which a binary behavior has been installed is not contained within the subtree. Both SVG and MathML have unique elements (svg and math, respectively) that satisfy this purpose. This gives proposal 3 some of the desirable characteristics of proposal 2 spelled out above.
In order to meet the Resistance to errors (e.g. not brittle in the face of syntax errors) requirement, something akin to Anne van Kesteren's XML5 would be required, an implementation of which can be seen on Google Code. Tbroyer: see also the namespaces-in-text-html branch of html5lib

Reasons why we can't do this

I've no idea what IE8 beta 1 is supposed to do. The whitepaper doesn't describe the processing model, the error handling, or how to handle legacy content, and IE8 Beta 1 doesn't seem to implement the whitepaper's syntax at all.

Extensions

Contents

Proposal 1: xmlns strawman

Reasons why we can't do this

Proposal 2: Extensibility Element

Error Handling

Embedded HTML

Fallback Behavior

Reasons why we can't do this

Proposal 3: XML5

Reasons why we can't do this

Navigation menu

Extensions

Proposal 1: xmlns strawman

Reasons why we can't do this

Proposal 2: Extensibility Element

Error Handling

Embedded HTML

Fallback Behavior

Reasons why we can't do this

Proposal 3: XML5

Reasons why we can't do this

Navigation menu

Search