Extensions

Ways to arbitrarily extend text/html for new vocabularies

Please put ideas for what it should look like here, each in their own section.

Each example should explain in details (ideally with examples) how to handle:

Syntax errors at the tokeniser level, the tree construction level, and the schema level.
Existing content that happens to use elements or syntax that you are proposing have special processing rules.
Pages that contain any special syntax after that syntax was copied and pasted by an ignorant Web author from a valid page written by a competent Web author aware of the new syntax.

See also SVG-specific proposals in Diagrams in HTML.

Proposal 1: xmlns strawman

When you hit an element with an xmlns="" attribute, switch to an XML parser until that parser has parsed the matching end tag.

 bla bla text/html bla bla <foo xmlns="http://example.com/foo"><this><must/>
 be<valid>XML! </valid></this> must be.</foo> bla bla text/html

Errors cause the entire page to stop parsing.

Existing pages are not handled.

Pages that copy-and-paste this syntax then use it incorrectly are not handled.

Reasons why we can't do this

There are pages that already specify xmlns="" attributes that would break if the content were processed as XML. For example, http://www.live.com/.

Is there (much?) existing content with arbitrary namespaces, or could we enumerate a set of junk namespaces to ignore, and then do this for others?

Or could we at least do it for a whitelist of namespaces?

Is it OK to switch parsing mode mid-document like this (and to effectively require an XML parser in every UA)?

Or should we develop a namespace-aware HTML parser instead?

Proposal 2: Extensibility Element

This is a proposed generic extensibility point, for SVG and possibly MathML or other XML content, with the temporary placeholder name of <ext> (the real element name would be some token that doesn't clash with existing content). Inside you can use XML or another format. Naturally, any content placed in an <ext> element would have to be understood by the UA in order to render correctly, and more complex rules may need to be developed for specific kinds of interaction between the root document and the inline content, such as with script, CSS, etc. For some discussion, see the IRC logs.

 <p>Hello world. 
    <ext type="image/svg+xml">
       <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 10 10">
          <circle x="5" y="5 r="5" stroke="green"/>
       </svg>
    </ext> 
 </p>

We should define a content model for where the <ext> element can occur, and if there are implications for different locations (such as inside a table, a paragraph, the head, etc). The simplest thing, at least for SVG (and probably MathML), would be that it would have the same restrictions as an <img> element. Also, there should be a default block model for <ext> in CSS.

Please define this in enough detail that I can construct a tokeniser and tree constructor from the description. Ideally, just provide the actual tokeniser and tree constructor that you are proposing. This is what it looks like in the HTML5 spec (sections 8.2.3 and 8.2.4). It's probably easiest to define it as a delta from what the spec has today. Perhaps easier for you. :) Since you are the expert in the tokenizer and the tree construction algorithm, I ask you to extrapolate from what we've already contributed. If you require such precision of detail for all proposals, then the deadline should be expanded for submission, as 12 days from your initial announcement is insufficient time to make a complete proposal; I suggest that 2 months is more appropriate. -Shepazu

This first came up more than 2 years ago, and I've been asking for detailed proposals for most of that time, including repeatedly over the past few weeks; there's been plenty of time for proposals. Not quite accurate. In the past, you have said that you did not see this in scope for HTML5, but rather for HTML5.1 or such. If you are moving up the timeframe, that's great, but I see no reason to create the rush for this at the present time; the spec is not expected to be stable for at least a year and a half, at the most optimistic. -Shepazu

Regarding extrapolation: I've already tried to extrapolate, and I can't find an extrapolation that works with existing Web content (for example, all the possibilities I've tried end up with the problem I've listed below under "Why it won't work"). That's why I'm asking for details.

Notes:

This is similar to IE's "XML islands" with the <xml> element. It's believed that there are some conflicts with the <xml> element itself, since it creates a separate document that is tied to the <xml> element in the DOM, but more research is needed. See also Using XML Data Islands in Mozilla.

The <ext> element could potentially be an implicit element, generated by the HTML5 parser on encountering a start tag of e.g. "<svg " or "<math ". That would save authors of having to type this extra element, but has a drawback in that it doesn't provide fallback content for legacy UAs. -Ed

We could specify exactly what flavors of markup must be supported by a UA, and which may be supported by a UA. This would be rather restrictive, but could improve interoperability of UA features, and would ensure that the proper DOM interfaces are available. For example, SVG and MathML must be supported, and FooML may be supported (or something).

Error Handling

Notes:
The main options seem to be:
# strict XML parsing (not favored by many)
# very permissive error handling (as in HTML5); this idea is controversial 
  and has many open issues, which should be detailed below 
# moderate error handling, as detailed in SVG Tiny 1.2 
# other ideas?

The chief risk with permissive error handling is that it would create 
content that is not compatible across different UAs, including mobile 
devices and authoring tools.

Proposal:

Tree construction recovers from errors by closing the <svg> element, and not rendering any content after the error. What are you considering an error in tree construction?
Case folding is not supported within the main body of the <ext> element, though it would be within the <fallback> element.
The tree builder would assign the appropriate namespace URI to the element and attribute nodes it creates.
Unknown attributes and elements are ignored. What elements/attributes are known?
Unquoted attribute values will be ignored (should the element also not be rendered?)
If the "/>" is not found at the end of an element, all subsequent element will be placed as child elements of the element (and thus not rendered) until a matching closing tag is found, or until the a matching root tag is found, or until the "</ext>" element is found. Does this mean that <ext> introduces a new scope? e.g. what does the DOM look like for this?:
<table><tr><td><ext></table></ext>A
- If a matching closing tag is found, the element is closed, and subsequent elements are rendered as normal.
- If a matching closing root tag is found, the element is closed, the root tag is closed, and any subsequent elements are ignored if outside a root tag.
- If a matching closing <ext> tag is found, the element is closed, the root tag is closed, the <ext> tag is closed, and HTML processing continues as normal.

One question is how to find the matching tag. Which parsing mode is active after 2-open/1-close:

Notes (not sure this fits in the above proposal):
* Tokenizer recovers from errors by ignoring subsequent content 
  between the error and the closing <ext> tag, closing the <svg> 
  element and <ext> element, and moving on with normal HTML 
  processing; for SVG, any element with errors is not rendered.

See the following emails by Henri Sivonen for comparison and contrast:

Embedded HTML

The case of content inside a <foreignObject> element could be subject to the parsing model of the root document. (Note that this is only a partial solution, and more thought and details are needed.)

For content outside <foreignObject>, it should follow the XML processing rules.

Fallback Behavior

This is an opportunity to get nice fallback behavior, as well.

Here's a possible suggestion, where the raster image would show in UAs that didn't support the <ext> syntax, and the SVG would show in those that did (and which support SVG). In UAs which support <ext> and not SVG, the fallback would also be the raster. The fallback content should be inside a wrapper element (<fallback>), so that you can have rich fallback options, such as an image map, a table, <canvas> and an accompanying <script> element, or whatever; in this case, I also include fallback CSS to hide textual content in title, desc, and text elements, but it may be desirable to leave this content as alternate text to the image, even including styling.

For MathML content, a conditional CSS override could allow for CSS styling of MathML elements for those that don't render MathML natively.

Note: as stated before, the names of the <ext> and <fallback> elements are subject to change based on existing element names in the wild.

<html lang="en">
<head>
	<title>HTML Extensibility Test</title>
</head>
<body>
	<h1 id="test_of_extensibility">Test of Extensibility</h1>
	<p>This is a test of an extensibility point in text/html, with a fallback mechanism.</p>
	<ext type="image/svg+xml">
		<fallback>
			<img src="anIsland.png" alt="..."/>
			<style type='text/css'>
			   svg > * { display: none; }
		        </style>
		</fallback>
		<svg xmlns="http://www.w3.org/2000/svg"
		     xmlns:xlink="http://www.w3.org/1999/xlink"
		     width="100%" height="100%"
		     version="1.1">
			<title>My Title</title>
			<desc>schepers, 01-04-2008</desc>
			<circle id="circle_1" cx="75" cy="25" r="20" fill="lime" />
      			<text id='text_1' x='10' y='25' font-size='18' fill='crimson'>This is some text.</text>
		</svg>		
	</ext>
</body>
</html>

Reasons why we can't do this

It's not clear what the processing model being proposed actually is. However, there is already one problem:

The idea relies on not conflicting with legacy content. Unfortunately, whatever syntax we end up using, people will copy and paste it from documents that were written by competent authors that tested it against the new UAs, into documents written by authors who don't know about this, and who don't have the new UA, thus creating new "legacy documents" that use whatever syntax we come up with. Saying the risk is minimal doesn't mitigate this problem. It's a real problem, and we have to deal with it. Such content clearly wouldn't work in the legacy UAs, and so the mistake will have no reason to propagate (the novice author will have no incentive to copy such content if it doesn't work in their UA); further, this is an issue with any new HTML5 syntax. Please explain in more detail how this is a problem. -Shepazu

Say someone writes:

 <p>foo <newsyntax> ... </newsyntax> bar </p>

...and that that is all good, and then someone copies just the "foo" part, accidentally including the <newsyntax> bit:

 <p>bla bla foo <newsyntax> bla bla </p>

For most features in HTML5, nothing drastically bad will happen. With <ext>, depending on exactly what your proposal is, the rest of the page would now be showing an error or be ignored.

What happens for those features in HTML5 where something bad does happen? Which features are those? This will give me some idea of the range of acceptable behavior. -Shepazu

Note also that the fallback idea doesn't work. Elements like <script>, <style>, <title>, <input>, <textarea> etc, get treated as HTML elements in legacy UAs. Please expand on this; the <fallback> proposal relies on the fact that any content in the <fallback> element be treated as HTML, so it's not clear what the objection is. -Shepazu The objection is relating to content in the <svg> part, not the <fallback> part.

Relying on CSS for hiding the text content doesn't work either, because CSS is optional and might not be enabled (or supported). In this case, legacy browsers that don't support <fallback> and in which CSS (or optionally JS) is not enabled or supported would, unfortunately print the textnodes; this still doesn't break the page, however, and any page which relies on JS or CSS will always fail in such UAs; this is true of almost every Webapp. -Shepazu It means the fallback doesn't work -- showing both isn't fallback. And yes, pages relying on CSS are badly designed. HTML5 goes out of its way to make that unnecessary.

(It doesn't much matter, though, because fallback isn't one of the things we're trying to address with this.) This seems like an artificial constraint; any solution which solves another larger problem (even if that's not the problem being addressed) is not a bug, it's a feature. -Shepazu

Proposal 3: XML5

Microsoft has published a whitepaper, “Improved Namespace Support”, purporting to describe features of Windows Internet Explorer 8 beta 1. Caveat lector: the whitepaper reflects views that held sway within Microsoft Corporation at the time of publication (2008-03). The whitepaper disclaims both the accuracy of the whitepaper after its publication and organizational commitment to the features therein described.

Salient features:

Windows Internet Explorer 8 Beta 1 for Developers offers Web developers the opportunity to write standards-compliant HTML-based Web pages that support features (such as SVG, XUL, and MathML) in namespaces, provided that the client has installed appropriate handlers for those namespaces via binary behaviors. (A binary behavior is a type of ActiveX control.) Tbroyer: note that those behaviors don't change the way the markup is parsed into a DOM; at least for elements whose name contains a colon (haven't tested this in IE8, but this is the way it is since IE5.5)
Internet Explorer 8 does not support the XHTML namespace definition. Thus, default namespace declarations of XHTML are ignored (xmlns="http://www.w3.org/1999/xhtml"). Tbroyer: this means that you cannot switch from a default namespace back to HTML (actually, this is true in IE8 in a more general fashion: once you've set a default namespace (i.e. once you've leaved "HTML"), you cannot switch to another; the whitepaper describes this as "Nesting of multiple default namespaces is not allowed; in other words, a default namespace declaration inside of another default namespace declaration will be ignored."
Internet Explorer 8 does not support default namespace declarations on any known elements such as HTML, SCRIPT, DIV, or STYLE. If default namespace declarations are encountered on these elements, the declaration is ignored (for purposes of existing Web page compatibility).

A few notes:

While Microsoft's IE8 implementation as described by this whitepaper does not satisfy all of the requirements; the above list focuses on the parts that do.
While Microsoft's implementation is based on ActiveX (Tbroyer: see above, ActiveX give you the behaviors associated with a given namespace URI, but doesn't change the parsing algorithm), the situation could very well end up being similar to XMLHttpRequest whereby the functionality was first exposed via ActiveX, other browser vendors adopted an alternate object model interface to this same functionality, and that interface was later adopted and standardized.
While the white paper does not explicitly state this requirement, the approach works best if the simple name for the unknown (to HTML5) element which contains the default namespace declaration for which a binary behavior has been installed is not contained within the subtree. Both SVG and MathML have unique elements (svg and math, respectively) that satisfy this purpose. This gives proposal 3 some of the desirable characteristics of proposal 2 spelled out above.
In order to meet the Resistance to errors (e.g. not brittle in the face of syntax errors) requirement, something akin to Anne van Kesteren's XML5 would be required, an implementation of which can be seen on Google Code. Tbroyer: see also the namespaces-in-text-html branch of html5lib

Reasons why we can't do this

I've no idea what IE8 beta 1 is supposed to do. The whitepaper doesn't describe the processing model, the error handling, or how to handle legacy content, and IE8 Beta 1 doesn't seem to implement the whitepaper's syntax at all.

Proposal 4: XML inside comments

Standardize and extend Microsoft's conditional comments syntax, where we introduce a new "operator" (accepts), that takes a MIME type string as operand. This check must also be possible to perform via a new method, acceptsMimeType(in DOMString mimeType), in the ClientInformation interface, to properly handle fallback and legacy UAs.

Only conforming UAs that understand the new syntax AND has implemented native support (or has a corresponding plug-in installed) will parse the content inside such a comment
The conditional comment will act as a trigger for conforming UAs, allowing parsing of "XML-islands" with namespaces inside HTML
Tags with an xmlns attribute within a conditional comment should be parsed as XML
Since it's XML, it should be easy for a UA to support export/copy of a given MathML/SVG/etc. to the clipboard
An informative in-line error message should be displayed if non-valid XML is encountered (to help/force authors to behave). The rest of the page should be rendered and not be affected.
Legacy browsers will threat the conditional comment as a normal comment, and instead display the (optional) fallback content
Will not affect legacy content
HTML-validators will not report errors because of embedded XML

Example 1. (Using new javascript method for fallback):

<div id="mathmlFallback">2x</div>
<div id="mathmlContent" style="display:none;">
  <!--[if accepts text/mathml]>
    <math xmlns="http://www.w3.org/1998/Math/MathML"> 
       <mn>2</mn>
       <mi>x</mi>
    </math>

    <script type="text/javascript">
      if (window.navigator.acceptsMimeType && window.navigator.acceptsMimeType('text/mathml') { // Could we instead expand CSS media queries?
        document.getElementById('mathmlContent').style.display = 'block'; //show the surrounding container (and mathml)
        document.getElementById('mathmlFallback').style.display = 'none'; // hide the fallback content since UA accepts text/mathml
      }
    </script>
  <![endif]-->
</div>

Example 2. (Using new CSS media query type for fallback):

/* CSS rules using new "accepts" media query type, preferably in external stylesheet */
.mathmlContent {display: none;}    /* Default for legacy UAs */
@media accepts text/mathml {       /* Conforming UAs will apply rules */
  .mathmlFallback {display: none;} /* hide the fallback content since UA accepts text/mathml */
  .mathmlContent {display: block;} /* show the mathml content */
}
/* End of CSS */

<div class="mathmlFallback">2x</div>
<div class="mathmlContent">
  <!--[if accepts text/mathml]>
    <math xmlns="http://www.w3.org/1998/Math/MathML"> 
       <mn>2</mn>
       <mi>x</mi>
    </math>
  <![endif]-->
</div>

Example 3. (No fallback)

<!--[if accepts image/svg+xml]>
  <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
    <circle r="100" fill="red" stroke="blue" />
  </svg>
<![endif]-->

A few notes:

The conditional comment should be wrapped in a (for legacy UAs) hidden container so that comments inside conditional comment doesn't "bork" legacy UAs
Since a comment in conditional comment can make the fallback javascript execute for legacy UAs a new javascript method is required
- This will not be a problem if a new CSS media query type is introduced (also less klunky)
Conforming UAs should treat the conditional comment only as a trigger/filter, and not as a comment (comment in comment is then no problem for conformin UAs)
It is another issue entirely if the "browser sniffing part" of conditional comments also should be standardized

Reasons why we can't do this

"Klunky" (but then again, everything is relative)
Requires javascript for fallback/handling of legacy UAs
- Could we instead define a new "accepts" CSS media query type against supported MIME types? Something like: @media accepts text/mathml { .mathmlFallback {display:none;} }
Is the conditional comment syntax Microsoft IP? Is that possible since HTML is an open standard?
Is it "OK" to have real content inside a comment to avoid problems with legacy UAs, even if conforming UAs doesn't "see" it as a comment?
Is it OK to switch parsing mode mid-document like this (and to effectively require an XML parser in every UA)?
- Why not? If a UA implements this new syntax, and claims to support for example MathML or SVG, then of course it needs to be able to parse XML.

Extensions

Contents

Proposal 1: xmlns strawman

Reasons why we can't do this

Proposal 2: Extensibility Element

Error Handling

Embedded HTML

Fallback Behavior

Reasons why we can't do this

Proposal 3: XML5

Reasons why we can't do this

Proposal 4: XML inside comments

Reasons why we can't do this

Navigation menu

Extensions

Proposal 1: xmlns strawman

Reasons why we can't do this

Proposal 2: Extensibility Element

Error Handling

Embedded HTML

Fallback Behavior

Reasons why we can't do this

Proposal 3: XML5

Reasons why we can't do this

Proposal 4: XML inside comments

Reasons why we can't do this

Navigation menu

Search