A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Validator.nu Web Service Interface: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(Mention JSON)
(wikify and update a little)
Line 1: Line 1:
<P>I am just writing this down so I don’t forget it. There are no
This is a inline-commentable updated wiki copy of [http://hsivonen.iki.fi/validator-ws-ideas/ the original article].
immediate implementation plans. There are no implementation promises,
 
either. There especially are no hosting promises at this time.
==Motivation==
This is a inline-commentable wiki copy of [http://hsivonen.iki.fi/validator-ws-ideas/ the original article].</P>[[User:Hsivonen|hsivonen]] 15:51, 14 December 2006 (UTC)
 
<H2 id='motivation'>Motivation</H2>
First, I assume there is some level of interest in doing RELAX NG
<P>First, I assume there is some level of interest in doing RELAX NG
/ Schematron validation and HTML5 conformance checking. Next, it
/ Schematron validation and HTML5 conformance checking. Next, it
would be nice to enable applications that deal with documents to make
would be nice to enable applications that deal with documents to make
these checks automatically in addition to having the functionality
these checks automatically in addition to having the functionality
available for human operators as a Web app. For example, [http://golem.ph.utexas.edu/~distler/blog/archives/001054.html a content management system might check the input it is given].</P>
available for human operators as a Web app. For example, [http://golem.ph.utexas.edu/~distler/blog/archives/001054.html a content management system might check the input it is given].
<P>Java apps could just integrate a private copy of the Free Software
 
Java apps could just integrate a private copy of the Free Software
back end of the [http://hsivonen.iki.fi/validator/ validation]
back end of the [http://hsivonen.iki.fi/validator/ validation]
/ [http://hsivonen.iki.fi/validator/html5/ conformance checking] service. However, non-Java apps would benefit from
/ [http://hsivonen.iki.fi/validator/html5/ conformance checking] service. However, non-Java apps would benefit from
Line 16: Line 16:
Java service. The service instance could be hosted publicly or as a
Java service. The service instance could be hosted publicly or as a
local copy. Even some Java developers would elect to use such a
local copy. Even some Java developers would elect to use such a
service instead of integrating the back end as part of their own app.</P>
service instead of integrating the back end as part of their own app.
<H2 id='input'>Input Modes</H2>
 
<P>The schemas are expected to be relatively static. Therefore, I
==Input Modes==
 
The schemas are expected to be relatively static. Therefore, I
think preloading them into the service or letting the service
think preloading them into the service or letting the service
retrieve them is sufficient. Identification by URI works in both
retrieve them is sufficient. Identification by URI works in both
cases.</P>
cases.
<P>What needs different input modes is the document that is checked.</P>
 
<P>I think the following modes would make sense:</P>
What needs different input modes is the document that is checked.
<UL>
 
<LI><P>Document URI as a GET parameter; the service retrieves the
I think the following modes would make sense:
document by URI (already implemented).</P>
 
<LI><P>Document in a <CODE>data:</CODE> URI as a GET parameter.</P>
* Document URI as a GET parameter; the service retrieves the
<LI><P>Document POSTed as the HTTP entity body (the preferred Web
document by URI (already implemented).
service mode).</P>
* Document in a <CODE>data:</CODE> URI as a GET parameter.
<LI><P>Document POSTed as an <CODE>application/x-www-form-urlencoded</CODE>
* Document POSTed as the HTTP entity body (the preferred Web
form field value.</P>
service mode; already implemented).
<LI><P>Document POSTed as a <CODE>multipart/form-data</CODE> file
* Document POSTed as an <CODE>application/x-www-form-urlencoded</CODE>
upload.</P>
form field value.
</UL>
* Document POSTed as a <CODE>multipart/form-data</CODE> file
<P>In the first three modes, additional parameters would be
upload.
 
In the first three modes, additional parameters would be
communicated in the URI query string. In the last two modes,
communicated in the URI query string. In the last two modes,
additional parameters would be communicated like corresponding from
additional parameters would be communicated like corresponding from
fields are communicated as <CODE>application/x-www-form-urlencoded</CODE>
fields are communicated as <CODE>application/x-www-form-urlencoded</CODE>
and <CODE>multipart/form-data</CODE>.</P>
and <CODE>multipart/form-data</CODE>.
<P>I don’t particularly like the last two modes, but they are
 
I don’t particularly like the last two modes, but they are
needed to address feature requests and for parity with other
needed to address feature requests and for parity with other
services. Also, unlike the first three modes, the last two modes need
services. Also, unlike the first three modes, the last two modes need
companion UI changes, which is not nice. As a further complication,
companion UI changes, which is not nice. As a further complication,
the last two don’t come naturally with a <CODE>Content-Type</CODE>
the last two don’t come naturally with a <CODE>Content-Type</CODE>
for dispatching to an HTML5 parser or to an XML parser.</P>
for dispatching to an HTML5 parser or to an XML parser.
<P>All these input modes would share the same “service endpoint
 
All these input modes would share the same “service endpoint
URI” (and the same servlet class). The different cases can be
URI” (and the same servlet class). The different cases can be
distinguished from the HTTP method and in the POST cases from the
distinguished from the HTTP method and in the POST cases from the
<CODE>Content-Type</CODE> request header.</P>
<CODE>Content-Type</CODE> request header.
<H2 id='output'>Output Modes</H2>
 
<P>A Web service probably calls for an XML output format for maximal
==Output Modes==
 
A Web service probably calls for an XML output format for maximal
tool chain integration even though the current HTML output format
tool chain integration even though the current HTML output format
makes sense for browsers and can carry all the necessary data.</P>
makes sense for browsers and can carry all the necessary data.
<P>I think the following modes would make sense:</P>
 
<UL>
I think the following modes would make sense:
<LI><P>HTML with microformat-style <CODE>class</CODE> annotations
 
* HTML with microformat-style <CODE>class</CODE> annotations
(already implemented except the annotation granularity could be
(already implemented except the annotation granularity could be
better).</P>
better).
<LI><P>XHTML with microformat-style <CODE>class</CODE> annotations.</P>
* XHTML with microformat-style <CODE>class</CODE> annotations (already implemented).
<LI><P>A custom XML format that it super-simple and use element
* A custom XML format that it super-simple and use element
names for easier processing with tools that are biased towards
names for easier processing with tools that are biased towards
keying on element name rather than on attribute value.</P>
keying on element name rather than on attribute value.
</UL>
* JSON
* JSON [[User:Hsivonen|hsivonen]] 09:58, 27 December 2006 (UTC)
* Human-readably plain text (already implemented)
<P>For the HTML and XHTML output formats, there could be an option
* Emacs-compatible formatted text with one item per line
* Relax-compatible
* Unicorn-compatible
* W3C Validator-compatible SOAP
* EARL
 
For the HTML and XHTML output formats, there could be an option
for suppressing the input form. The output default should be HTML for
for suppressing the input form. The output default should be HTML for
the browser-targeted input formats. However, the custom XML format
the browser-targeted input formats. However, the custom XML format
might be a reasonable default when the input document was POSTed as
might be a reasonable default when the input document was POSTed as
the entity body.</P>
the entity body.
<H3 id='xml'>The XML Output Format (Draft)</H3>
 
<P>The elements in this XML vocabulary are in the namespace
===The XML Output Format (Draft)===
“<CODE>http://hsivonen.iki.fi/validator/messages/</CODE>”.  
 
The elements in this XML vocabulary are in the namespace
“<CODE>http://validotor.nu/messages/</CODE>”.  


* Perhaps the namespace URI should be a data: URI. If the ns URI does not contain any domain name, it cannot contain a domain name that someone is uncomfortable with. [[User:Hsivonen|hsivonen]] 14:24, 18 December 2006 (UTC)
* Perhaps the namespace URI should be a data: URI. If the ns URI does not contain any domain name, it cannot contain a domain name that someone is uncomfortable with. [[User:Hsivonen|hsivonen]] 14:24, 18 December 2006 (UTC)
Line 79: Line 96:
attributes in this XML vocabulary are not in a namespace. The
attributes in this XML vocabulary are not in a namespace. The
attribute values defined for this XML vocabulary must not have
attribute values defined for this XML vocabulary must not have
preceding or trailing white space.</P>
preceding or trailing white space.
<P>Note: The format has been designed to support streaming generation
 
and consumption.</P>
 
<H4 id='structure'>Structure and Semantics</H4>
Note: The format has been designed to support streaming generation
<P>The format consists of an XML 1.0 document that has the element
and consumption.
 
====Structure and Semantics====
 
The format consists of an XML 1.0 document that has the element
<CODE>messages</CODE> as the root element.  
<CODE>messages</CODE> as the root element.  
</P>
 
<P>The root element may zero or more child elements named <CODE>info</CODE>,
The root element may zero or more child elements named <CODE>info</CODE>,
<CODE>warning</CODE> and <CODE>error</CODE>. The element <CODE>info</CODE>
<CODE>warning</CODE> and <CODE>error</CODE>. The element <CODE>info</CODE>
means an informational message. The element <CODE>warning</CODE>
means an informational message. The element <CODE>warning</CODE>
Line 93: Line 114:
a problem that causes the validation/checking to fail. The character
a problem that causes the validation/checking to fail. The character
data content of these three elements may contain a human-readable
data content of these three elements may contain a human-readable
message. (Entity-escaped HTML is <EM>not</EM> allowed. :-)</P>
message. (Entity-escaped HTML is <EM>not</EM> allowed. :-)
<P>The elements <CODE>info</CODE>, <CODE>warning</CODE> and <CODE>error</CODE>
 
The elements <CODE>info</CODE>, <CODE>warning</CODE> and <CODE>error</CODE>
have three optional attributes for indicating the context of the
have three optional attributes for indicating the context of the
message: <CODE>uri</CODE>, <CODE>line</CODE> and <CODE>column</CODE>.
message: <CODE>uri</CODE>, <CODE>line</CODE> and <CODE>column</CODE>.
The <CODE>column</CODE> attribute must not be present unless the <CODE>line</CODE>
The <CODE>column</CODE> attribute must not be present unless the <CODE>line</CODE>
attribute is present as well.  
attribute is present as well.  
</P>
 
<P>The <CODE>uri</CODE> attribute, if present, must containt the URI
The <CODE>uri</CODE> attribute, if present, must containt the URI
(not IRI) of the HTTP resource with which the message is associated
(not IRI) of the HTTP resource with which the message is associated
or the literal string “<CODE>data:…</CODE>” (the last character
or the literal string “<CODE>data:…</CODE>” (the last character
Line 106: Line 128:
resource but the exact URI has been omitted. (If a client application
resource but the exact URI has been omitted. (If a client application
wishes to show IRIs to human users, it is up to the client
wishes to show IRIs to human users, it is up to the client
application to convert the URI into an IRI.)</P>
application to convert the URI into an IRI.)
<P>The <CODE>line</CODE> attribute, if present, must contain a string
 
The <CODE>line</CODE> attribute, if present, must contain a string
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
DIGIT NINE which when interpreted as a base-ten integer is a positive
DIGIT NINE which when interpreted as a base-ten integer is a positive
integer (not zero). This number means the approximate source text
integer (not zero). This number means the approximate source text
line number associated with the message. The first line is 1.</P>
line number associated with the message. The first line is 1.
<P>The <CODE>column</CODE> attribute, if present, must contain a
 
The <CODE>column</CODE> attribute, if present, must contain a
string consisting of characters in the range U+0030 DIGIT ZERO to
string consisting of characters in the range U+0030 DIGIT ZERO to
U+0039 DIGIT NINE which when interpreted as a base-ten integer is a
U+0039 DIGIT NINE which when interpreted as a base-ten integer is a
Line 118: Line 142:
column number associated with the message on the line indicated by
column number associated with the message on the line indicated by
the <CODE>line</CODE> attribute. The first character on a line is in
the <CODE>line</CODE> attribute. The first character on a line is in
column 1.</P>
column 1.
<P>The source lines and columns are approximate. For example, if a
 
The source lines and columns are approximate. For example, if a
message is related to an attribute, the line and column may point to
message is related to an attribute, the line and column may point to
the first character if the start tag, the character after the start
the first character if the start tag, the character after the start
Line 126: Line 151:
inaccurate within a run of text e.g. due to buffering. Furthermore,
inaccurate within a run of text e.g. due to buffering. Furthermore,
implementation may count column numbers in terms of UTF-16 code units
implementation may count column numbers in terms of UTF-16 code units
instead of characters.</P>
instead of characters.
<P>The <CODE>error</CODE> element may have an attribute called <CODE>type</CODE>
 
The <CODE>error</CODE> element may have an attribute called <CODE>type</CODE>
for indicating that an error is not a general error. Permissible
for indicating that an error is not a general error. Permissible
values for the <CODE>type</CODE> attribute are: <CODE>fatal</CODE>
values for the <CODE>type</CODE> attribute are: <CODE>fatal</CODE>
Line 135: Line 161:
initializing a schema-based validator failed) and <CODE>internal</CODE>
initializing a schema-based validator failed) and <CODE>internal</CODE>
(indicates that the validator/checker found an error bug in itself,
(indicates that the validator/checker found an error bug in itself,
ran out of memory, etc., but was still able to emit a message).</P>
ran out of memory, etc., but was still able to emit a message).
<P>The validation/checking is considered to have failed if there is
one or more <CODE>error</CODE> element.</P>


* Perhaps io, schema and internal errors should have a different element and the occurrence of this element would be deemed to mean that the result in indeterminate, because the document did not have a chance to fail on its own right. [[User:Hsivonen|hsivonen]] 08:34, 15 December 2006 (UTC)
The validation/checking is considered to have failed if there is
one or more <CODE>error</CODE> element.


<H4 id='processing'>Processing Model</H4>
Perhaps io, schema and internal errors should have a different element and the occurrence of this element would be deemed to mean that the result in indeterminate, because the document did not have a chance to fail on its own right.
<P>Clients that consume the message format are referred to as
 
====Processing Model====
 
Clients that consume the message format are referred to as
processors. They must use a conforming XML 1.0 processor to parse the
processors. They must use a conforming XML 1.0 processor to parse the
format.</P>
format.
<P>If the root element is not an element named <CODE>messages</CODE>,
 
If the root element is not an element named <CODE>messages</CODE>,
the document is deemed to be in an unknown format and not processable
the document is deemed to be in an unknown format and not processable
according to this processing model.</P>
according to this processing model.
<P>If a processor encounters an element that it doesn’t recognize,
 
If a processor encounters an element that it doesn’t recognize,
it must process the content of the element as if the start tag and
it must process the content of the element as if the start tag and
the end tag of the element were not there. If the processor encounter
the end tag of the element were not there. If the processor encounter
Line 160: Line 190:
<CODE>warning</CODE> or <CODE>error</CODE> element does not have a
<CODE>warning</CODE> or <CODE>error</CODE> element does not have a
<CODE>line</CODE> attribute with a permissible value, a <CODE>column</CODE>
<CODE>line</CODE> attribute with a permissible value, a <CODE>column</CODE>
attribute on the element must be ignored if present.</P>
attribute on the element must be ignored if present.
<P>Note: These rules make it possible to add markup for source code
 
Note: These rules make it possible to add markup for source code
dumps, document outlines and parse trees later without breaking
dumps, document outlines and parse trees later without breaking
clients. Also, it make it possible to introduce e.g. XHTML markup in
clients. Also, it make it possible to introduce e.g. XHTML markup in
the human-readable messages.</P>
the human-readable messages.
<P>Processors must process elements in a way that is consistent with
 
the semantics of the elements.</P>
Processors must process elements in a way that is consistent with
<P>The determine if the validation/checking succeeded, processors
the semantics of the elements.
 
The determine if the validation/checking succeeded, processors
must determine whether the root element has no <CODE>error</CODE>
must determine whether the root element has no <CODE>error</CODE>
element children. If there are no <CODE>error</CODE> children, the
element children. If there are no <CODE>error</CODE> children, the
validation/checking succeeded. Otherwise, it failed.</P>
validation/checking succeeded. Otherwise, it failed.


<H2 id='prior'>Prior Art</H2>
==Prior Art==
<P>The W3C has defined two XML output formats for the W3C Validator:
[http://validator.w3.org/docs/api.html the SOAP format]
and [http://www.w3.org/QA/2006/obs_framework/response/ the Unicorn format].


* It has been pointed out to me that the W3C has a third format: [http://www.w3.org/TR/EARL10/ EARL]. [[User:Hsivonen|hsivonen]] 10:32, 17 December 2006 (UTC)
The W3C has defined three XML output formats for the W3C Validator:
[http://validator.w3.org/docs/api.html the SOAP format], [http://www.w3.org/QA/2006/obs_framework/response/ the Unicorn format] and [http://www.w3.org/TR/EARL10/ EARL].  


I think there are two problems with these
I think there are two problems with the SOAP and Unicorn
formats: they are unnecessarily complex and they don’t support
formats: they are unnecessarily complex and they don’t support
streaming output. For example, they require a redundant declaration
streaming output. For example, they require a redundant declaration
of the number of errors before the errors themselves (which a client
of the number of errors before the errors themselves (which a client
could count on its own if it wants to know the number).</P>
could count on its own if it wants to know the number).
<P>The W3C Validator also provides simple pass/fail information as
 
The EARL format assumes that each testable condition has a well-known URI, which does not fit with grammar-based validation and now even with vanilla Schematron.
 
The W3C Validator also provides simple pass/fail information as
[http://validator.w3.org/docs/api.html#http_headers HTTP headers], which is nice if you only care about a boolean
[http://validator.w3.org/docs/api.html#http_headers HTTP headers], which is nice if you only care about a boolean
pass/fail. However, this approach also has the problem the it
pass/fail. However, this approach also has the problem the it
precludes streaming, because the validation process has to finish
precludes streaming, because the validation process has to finish
before the HTTP headers can be written.</P>
before the HTTP headers can be written.
<P>For these reasons, I am not particularly keen on reusing the
 
For these reasons, I am not particularly keen on reusing the
output formats of the W3C Validator unless it turns out that there
output formats of the W3C Validator unless it turns out that there
are significant [http://en.wikipedia.org/wiki/Network_effect network benefits] to be reaped from plugging into an existing network of
are significant [http://en.wikipedia.org/wiki/Network_effect network benefits] to be reaped from plugging into an existing network of
client software. It seems to me that there isn’t a significant
client software. It seems to me that there isn’t a significant
network of existing client software.</P>
network of existing client software.

Revision as of 11:17, 7 September 2007

This is a inline-commentable updated wiki copy of the original article.

Motivation

First, I assume there is some level of interest in doing RELAX NG / Schematron validation and HTML5 conformance checking. Next, it would be nice to enable applications that deal with documents to make these checks automatically in addition to having the functionality available for human operators as a Web app. For example, a content management system might check the input it is given.

Java apps could just integrate a private copy of the Free Software back end of the validation / conformance checking service. However, non-Java apps would benefit from having the validation / conformance checking service running out of process and having an interface for talking to the out-of-process Java service. The service instance could be hosted publicly or as a local copy. Even some Java developers would elect to use such a service instead of integrating the back end as part of their own app.

Input Modes

The schemas are expected to be relatively static. Therefore, I think preloading them into the service or letting the service retrieve them is sufficient. Identification by URI works in both cases.

What needs different input modes is the document that is checked.

I think the following modes would make sense:

  • Document URI as a GET parameter; the service retrieves the

document by URI (already implemented).

  • Document in a data: URI as a GET parameter.
  • Document POSTed as the HTTP entity body (the preferred Web

service mode; already implemented).

  • Document POSTed as an application/x-www-form-urlencoded

form field value.

  • Document POSTed as a multipart/form-data file

upload.

In the first three modes, additional parameters would be communicated in the URI query string. In the last two modes, additional parameters would be communicated like corresponding from fields are communicated as application/x-www-form-urlencoded and multipart/form-data.

I don’t particularly like the last two modes, but they are needed to address feature requests and for parity with other services. Also, unlike the first three modes, the last two modes need companion UI changes, which is not nice. As a further complication, the last two don’t come naturally with a Content-Type for dispatching to an HTML5 parser or to an XML parser.

All these input modes would share the same “service endpoint URI” (and the same servlet class). The different cases can be distinguished from the HTTP method and in the POST cases from the Content-Type request header.

Output Modes

A Web service probably calls for an XML output format for maximal tool chain integration even though the current HTML output format makes sense for browsers and can carry all the necessary data.

I think the following modes would make sense:

  • HTML with microformat-style class annotations

(already implemented except the annotation granularity could be better).

  • XHTML with microformat-style class annotations (already implemented).
  • A custom XML format that it super-simple and use element

names for easier processing with tools that are biased towards keying on element name rather than on attribute value.

  • JSON
  • Human-readably plain text (already implemented)
  • Emacs-compatible formatted text with one item per line
  • Relax-compatible
  • Unicorn-compatible
  • W3C Validator-compatible SOAP
  • EARL

For the HTML and XHTML output formats, there could be an option for suppressing the input form. The output default should be HTML for the browser-targeted input formats. However, the custom XML format might be a reasonable default when the input document was POSTed as the entity body.

The XML Output Format (Draft)

The elements in this XML vocabulary are in the namespace “http://validotor.nu/messages/”.

  • Perhaps the namespace URI should be a data: URI. If the ns URI does not contain any domain name, it cannot contain a domain name that someone is uncomfortable with. hsivonen 14:24, 18 December 2006 (UTC)

The attributes in this XML vocabulary are not in a namespace. The attribute values defined for this XML vocabulary must not have preceding or trailing white space.


Note: The format has been designed to support streaming generation and consumption.

Structure and Semantics

The format consists of an XML 1.0 document that has the element messages as the root element.

The root element may zero or more child elements named info, warning and error. The element info means an informational message. The element warning signifies a potential problem that does not cause the validation/checking to fail. The element error signifies a problem that causes the validation/checking to fail. The character data content of these three elements may contain a human-readable message. (Entity-escaped HTML is not allowed. :-)

The elements info, warning and error have three optional attributes for indicating the context of the message: uri, line and column. The column attribute must not be present unless the line attribute is present as well.

The uri attribute, if present, must containt the URI (not IRI) of the HTTP resource with which the message is associated or the literal string “data:…” (the last character is U+2026) to signify that the message is associated with a data URI resource but the exact URI has been omitted. (If a client application wishes to show IRIs to human users, it is up to the client application to convert the URI into an IRI.)

The line attribute, if present, must contain a string consisting of characters in the range U+0030 DIGIT ZERO to U+0039 DIGIT NINE which when interpreted as a base-ten integer is a positive integer (not zero). This number means the approximate source text line number associated with the message. The first line is 1.

The column attribute, if present, must contain a string consisting of characters in the range U+0030 DIGIT ZERO to U+0039 DIGIT NINE which when interpreted as a base-ten integer is a positive integer (not zero). This number means the approximate source column number associated with the message on the line indicated by the line attribute. The first character on a line is in column 1.

The source lines and columns are approximate. For example, if a message is related to an attribute, the line and column may point to the first character if the start tag, the character after the start tag or to the attribute inside the tag depending on implementation. If a message is related to character data, the line and column may be inaccurate within a run of text e.g. due to buffering. Furthermore, implementation may count column numbers in terms of UTF-16 code units instead of characters.

The error element may have an attribute called type for indicating that an error is not a general error. Permissible values for the type attribute are: fatal (signifies a well-formedness violation or another error after which no more checking was performed), io (signifies an input/output error), schema (indicates that initializing a schema-based validator failed) and internal (indicates that the validator/checker found an error bug in itself, ran out of memory, etc., but was still able to emit a message).

The validation/checking is considered to have failed if there is one or more error element.

Perhaps io, schema and internal errors should have a different element and the occurrence of this element would be deemed to mean that the result in indeterminate, because the document did not have a chance to fail on its own right.

Processing Model

Clients that consume the message format are referred to as processors. They must use a conforming XML 1.0 processor to parse the format.

If the root element is not an element named messages, the document is deemed to be in an unknown format and not processable according to this processing model.

If a processor encounters an element that it doesn’t recognize, it must process the content of the element as if the start tag and the end tag of the element were not there. If the processor encounter character data as a child of the root element (after applying the rule stated in the previous sentence), it must act as if the character data was not there. If a processor encounters an attribute that it does not recognize, it must ignore the entire attribute. If a processor encounters an attribute that it does recognize but the value of the attribute is not permissible under the previous section, the processor must ignore the entire attribute. If an info, warning or error element does not have a line attribute with a permissible value, a column attribute on the element must be ignored if present.

Note: These rules make it possible to add markup for source code dumps, document outlines and parse trees later without breaking clients. Also, it make it possible to introduce e.g. XHTML markup in the human-readable messages.

Processors must process elements in a way that is consistent with the semantics of the elements.

The determine if the validation/checking succeeded, processors must determine whether the root element has no error element children. If there are no error children, the validation/checking succeeded. Otherwise, it failed.

Prior Art

The W3C has defined three XML output formats for the W3C Validator: the SOAP format, the Unicorn format and EARL.

I think there are two problems with the SOAP and Unicorn formats: they are unnecessarily complex and they don’t support streaming output. For example, they require a redundant declaration of the number of errors before the errors themselves (which a client could count on its own if it wants to know the number).

The EARL format assumes that each testable condition has a well-known URI, which does not fit with grammar-based validation and now even with vanilla Schematron.

The W3C Validator also provides simple pass/fail information as HTTP headers, which is nice if you only care about a boolean pass/fail. However, this approach also has the problem the it precludes streaming, because the validation process has to finish before the HTTP headers can be written.

For these reasons, I am not particularly keen on reusing the output formats of the W3C Validator unless it turns out that there are significant network benefits to be reaped from plugging into an existing network of client software. It seems to me that there isn’t a significant network of existing client software.