A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Validator.nu XML Output: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(→‎Structure and Semantics: Revamping messages vol 1)
No edit summary
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Obsolete|spec=https://github.com/validator/validator/wiki/Output-»-XML}}
==Goal==
==Goal==


The native XML output format for Validator.nu. This format should be able to expose everything there is to expose in Validator.nu results. (Other XML formats may not fit Validator.nu exactly.)
The native XML output format for Validator.nu for integration into content management systems, etc. This format should be able to expose everything there is to expose in Validator.nu results. (Other XML formats may not fit Validator.nu exactly.)
 
Note: The format has been designed to support streaming generation
and consumption.
 
==Media Type==
 
The Internet media type for this format is <code>application/xml</code>.


==Namespaces==
==Namespaces==
Line 14: Line 22:
attribute values defined for this XML vocabulary must not have
attribute values defined for this XML vocabulary must not have
preceding or trailing white space.
preceding or trailing white space.
Note: The format has been designed to support streaming generation
and consumption.


==Structure and Semantics==
==Structure and Semantics==
Line 25: Line 29:


The root elements may contain zero or more messages elements (<CODE>info</CODE>,
The root elements may contain zero or more messages elements (<CODE>info</CODE>,
<CODE>error</CODE> and <CODE>non-document-error</CODE>), followed by exactly one verdict element (<CODE>success</CODE>,
<CODE>error</CODE> and <CODE>non-document-error</CODE>), optionally followed by one <CODE>source</CODE> element, optionally followed by one <CODE>parse-tree</CODE> element.
<CODE>failure</CODE> or <CODE>indeterminate</CODE>), optionally followed by one <CODE>source</CODE> element, optionally followed by one <CODE>parse-tree</CODE> element.
 
The root element may have an optional attribute <code>url</code>. The <CODE>url</CODE> attribute, if present, must containt the URI
(not IRI) of the document being checked
or the literal string “<CODE>data:…</CODE>” (the last character
is U+2026) to signify that the message is associated with a data URI
resource but the exact URI has been omitted. (If a client application
wishes to show IRIs to human users, it is up to the client
application to convert the URI into an IRI.)


===Message Elements===
===Message Elements===
Line 36: Line 47:
signifies an error that causes the checking to end in an indeterminate state because  
signifies an error that causes the checking to end in an indeterminate state because  
the document being validated could not be examined to the end. Examples of such errors include broken schemas, bugs in the validator and IO errors. (Note that when a schema has parse errors, they are first reported as <CODE>error</CODE>s and then a catch-all <CODE>non-document-error</CODE> is also emitted.)
the document being validated could not be examined to the end. Examples of such errors include broken schemas, bugs in the validator and IO errors. (Note that when a schema has parse errors, they are first reported as <CODE>error</CODE>s and then a catch-all <CODE>non-document-error</CODE> is also emitted.)
====Locator Attributes====


The elements <CODE>info</CODE>, <CODE>error</CODE> and <CODE>non-document-error</CODE>
The elements <CODE>info</CODE>, <CODE>error</CODE> and <CODE>non-document-error</CODE>
have three optional attributes for indicating the context of the
have five optional attributes for indicating the context of the
message: <CODE>uri</CODE>, <CODE>line</CODE> and <CODE>column</CODE>.
message: <CODE>url</CODE>, <CODE>first-line</CODE>, <CODE>last-line</CODE>, <CODE>first-column</CODE> and <CODE>last-column</CODE>.
The <CODE>column</CODE> attribute must not be present unless the <CODE>line</CODE>
The <CODE>first-column</CODE> attribute must not be present unless the <CODE>first-line</CODE>
attribute is present as well. The <CODE>last-column</CODE> attribute must not be present unless the <CODE>last-line</CODE>
attribute is present as well. The <CODE>first-line</CODE> attribute must not be present unless the <CODE>last-line</CODE>
attribute is present as well.  
attribute is present as well.  


The <CODE>uri</CODE> attribute, if present, must containt the URI
The <CODE>url</CODE> attribute, if present, must contain the URI
(not IRI) of the HTTP resource with which the message is associated
(not IRI) of the resource with which the message is associated
or the literal string “<CODE>data:…</CODE>” (the last character
or the literal string “<CODE>data:…</CODE>” (the last character
is U+2026) to signify that the message is associated with a data URI
is U+2026) to signify that the message is associated with a data URI
Line 51: Line 66:
application to convert the URI into an IRI.)
application to convert the URI into an IRI.)


The <CODE>line</CODE> attribute, if present, must contain a string
If the <CODE>url</CODE> attribute is absent on the message element but present on the root element, the message is considered to be associated with the resource designated by the attribute on the root element.
 
The <CODE>first-line</CODE>, <CODE>last-line</CODE>, <CODE>first-column</CODE> and <CODE>last-column</CODE> attribute, if present, must contain a string
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
DIGIT NINE which when interpreted as a base-ten integer is a positive
DIGIT NINE which when interpreted as a base-ten integer is a positive
integer (not zero). This number means the approximate source text
integer (not zero). The line and column numbers are one-based. The first line is line 1. The first column is column 1. Columns are counted by UTF-16 code units. A line break is considered to occupy the last column on the line it terminates.
line number associated with the message. The first line is 1.
 
The <CODE>column</CODE> attribute, if present, must contain a
string consisting of characters in the range U+0030 DIGIT ZERO to
U+0039 DIGIT NINE which when interpreted as a base-ten integer is a
positive integer (not zero). This number means the approximate source
column number associated with the message on the line indicated by
the <CODE>line</CODE> attribute. The first character on a line is in
column 1.


The source lines and columns are approximate. For example, if a
The source lines and columns are approximate. For example, if a
Line 70: Line 78:
tag or to the attribute inside the tag depending on implementation.
tag or to the attribute inside the tag depending on implementation.
If a message is related to character data, the line and column may be
If a message is related to character data, the line and column may be
inaccurate within a run of text e.g. due to buffering. Furthermore,
inaccurate within a run of text e.g. due to buffering.
implementation may count column numbers in terms of UTF-16 code units
 
instead of characters.
The <CODE>last-line</CODE> attribute indicates the last line (inclusive) onto which the source range associated with the message falls.
 
The <CODE>first-line</CODE> attribute indicates the first line onto which the source range associated with the message falls. If the attribute is missing, it is assumed to have the same value as <CODE>last-line</CODE>.
 
The <CODE>last-column</CODE> attribute indicates the last column (inclusive) onto which the source range associated with the message falls on the last line onto which is falls.
 
The <CODE>first-column</CODE> attribute indicates the first column onto which the source range associated with the message falls on the first line onto which is falls.
 
====The <CODE>type</CODE> Attribute====


The <CODE>info</CODE>, <CODE>error</CODE> and <CODE>non-document-error</CODE> element may have an attribute called <CODE>type</CODE>
The <CODE>info</CODE>, <CODE>error</CODE> and <CODE>non-document-error</CODE> element may have an attribute called <CODE>type</CODE>
Line 89: Line 105:
====Children of Message Elements====
====Children of Message Elements====


===Verdict Elements===
The <code>info</code>, <code>error</code> and <code>non-document-error</code> elements may contain the following optional elements (in this order): <code>message</code>, <code>elaboration</code> (only if <code>message</code> is present as well) and <code>extract</code>.
 
=====The <code>message</code> Element=====
 
The <code>message</code> element represents a paragraph of text that is the message stated succinctly in natural language.  Permissible element content consists of an interleaving of zero or more text nodes, zero or more <code>a</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace and zero or more <code>code</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace. The <code>code</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace may contain text. The <code>a</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace may contain an interleaving of zero or more text nodes and zero or more <code>code</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace. The <code>a</code> elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace must have the attribute <code>href</code> and may have the attribute <code>title</code>.
 
=====The <code>elaboration</code> Element=====
 
The <code>elaboration</code> element provides additional human-readable guidance related to the message. The content model of this element is block level content (elements in the “<CODE>http://www.w3.org/1999/xhtml</CODE>” namespace) as defined by [http://www.whatwg.org/specs/web-apps/current-work/ HTML 5].
 
=====The <code>extract</code> Element=====
 
The <code>extract</code> element represents an extract of the document source from around the point in source designated for the message by the <code>line</code> and <code>column</code> attributes on the message element. The <code>extract</code> element contains an interleaving zero or more text nodes and exactly one <code>m</code> element. The <code>m</code> element represents a highlighted part of the extract that pinpoints the source position associated with the message. The <code>m</code> element contains the highlighted part of the text. White space is significant in the subtree rooted at <code>extract</code>.


===The <CODE>source</CODE> Element===
===The <CODE>source</CODE> Element===
The <CODE>source</CODE> element represents the source of the checked document as decoded to Unicode with XML-unsafe characters replaced with the REPLACEMENT CHARACTER and with line breaks replaced with U+00A0 LINE FEED. The element may contain text that is the source. White space is significant in the content.
The element has two optional attributes: <code>type</code> and <code>encoding</code>. The <code>type</code> attribute represents the media type of the input without parameters. The <code>encoding</code> attribute represents the <code>charset</code> media type parameter.


===The <CODE>parse-tree</CODE> Element===
===The <CODE>parse-tree</CODE> Element===
The <CODE>parse-tree</CODE> element contains the information items of the parsed infoset that are the children of the document information item (recursively) encoded as follows:
Comment information items are not represented. Processing instruction information items are represented as element <code>pi</code> with the target in attribute <code>target</code> and data in content. Elements are represented as elements and attributes as attributes, but each namespace <var>ns</var> is substituted with a namespace <code>http://n.validator.nu/?ns=<var>escaped</var></code>, where <var>escaped</var> is the URI (percent) escaped representation of the UTF-8 representation of <var>ns</var>.
* The content of this element should probably be in [http://simon.html5.org/specs/sdf SDF] instead as suggested by zcorpan. [[User:Hsivonen|hsivonen]] 14:59, 11 September 2007 (UTC)


==Processing Model==
==Processing Model==
Line 108: Line 146:
it must process the content of the element as if the start tag and
it must process the content of the element as if the start tag and
the end tag of the element were not there. If the processor encounter
the end tag of the element were not there. If the processor encounter
character data as a child of the root element (after applying the
character data as a child of the root or a message element element (after applying the
rule stated in the previous sentence), it must act as if the
rule stated in the previous sentence), it must act as if the
character data was not there. If a processor encounters an attribute
character data was not there. If a processor encounters an attribute
Line 114: Line 152:
processor encounters an attribute that it does recognize but the
processor encounters an attribute that it does recognize but the
value of the attribute is not permissible under the previous section,
value of the attribute is not permissible under the previous section,
the processor must ignore the entire attribute. If an <CODE>info</CODE>,
the processor must ignore the entire attribute. If an <CODE>info</CODE>, <CODE>error</CODE> or <CODE>non-document-error</CODE> element does not have a
<CODE>warning</CODE> or <CODE>error</CODE> element does not have a
<CODE>last-line</CODE> attribute with a permissible value, a <CODE>last-column</CODE>
<CODE>line</CODE> attribute with a permissible value, a <CODE>column</CODE>
attribute on the element must be ignored if present. If an <CODE>info</CODE>, <CODE>error</CODE> or <CODE>non-document-error</CODE> element does not have a
<CODE>first-line</CODE> attribute with a permissible value, a <CODE>first-column</CODE>
attribute on the element must be ignored if present.
attribute on the element must be ignored if present.
Note: These rules make it possible to add markup for source code
dumps, document outlines and parse trees later without breaking
clients. Also, it make it possible to introduce e.g. XHTML markup in
the human-readable messages.


Processors must process elements in a way that is consistent with
Processors must process elements in a way that is consistent with
the semantics of the elements.
the semantics of the elements.


The determine if the validation/checking succeeded, processors
===Determining Outcome===
must determine whether the root element has no <CODE>error</CODE>
 
element children. If there are no <CODE>error</CODE> children, the
The outcome of the validation process may be success, failure or indeterminate.
validation/checking succeeded. Otherwise, it failed.
 
# If there are one or more <CODE>non-document-error</CODE> elements, the outcome is indeterminate.
# Else if there are one or more <CODE>error</CODE> elements, the outcome is failure.
# Else the outcome is success.


==Prior Art==
==Prior Art==


The W3C has defined three XML output formats for the W3C Validator:
The W3C has defined three XML output formats for the W3C Validator:
[http://validator.w3.org/docs/api.html the SOAP format], [http://www.w3.org/QA/2006/obs_framework/response/ the Unicorn format] and [http://www.w3.org/TR/EARL10/ EARL].  
[http://validator.w3.org/docs/api.html the SOAP format], [http://www.w3.org/QA/2006/obs_framework/response/ the Unicorn format] and [http://www.w3.org/TR/EARL10/ EARL]. Relaxed has an XML format, but I’m not aware of a spec for it.


I think there are two problems with the SOAP and Unicorn
I think there are two problems with the SOAP and Unicorn
Line 156: Line 193:
client software. It seems to me that there isn’t a significant
client software. It seems to me that there isn’t a significant
network of existing client software.
network of existing client software.
==See also==
*[[Validator.nu Web Service Interface]]
[[Category:Validator.nu Documentation]]

Latest revision as of 04:37, 29 December 2016

This document is obsolete.

For the current specification, see: https://github.com/validator/validator/wiki/Output-»-XML

Goal

The native XML output format for Validator.nu for integration into content management systems, etc. This format should be able to expose everything there is to expose in Validator.nu results. (Other XML formats may not fit Validator.nu exactly.)

Note: The format has been designed to support streaming generation and consumption.

Media Type

The Internet media type for this format is application/xml.

Namespaces

The elements in this XML vocabulary are in the namespace “http://n.validator.nu/messages/”. This vocabulary reuses elements from the “http://www.w3.org/1999/xhtml” namespace for human-readable messages. The semantics for the elements in the “http://www.w3.org/1999/xhtml” namespace are defined in HTML 5.

  • Perhaps the namespace URI should be a data: URI. If the ns URI does not contain any domain name, it cannot contain a domain name that someone is uncomfortable with. hsivonen 14:24, 18 December 2006 (UTC)

The attributes in this XML vocabulary are not in a namespace. The attribute values defined for this XML vocabulary must not have preceding or trailing white space.

Structure and Semantics

The format consists of an XML 1.0 document that has the element messages as the root element.

The root elements may contain zero or more messages elements (info, error and non-document-error), optionally followed by one source element, optionally followed by one parse-tree element.

The root element may have an optional attribute url. The url attribute, if present, must containt the URI (not IRI) of the document being checked or the literal string “data:…” (the last character is U+2026) to signify that the message is associated with a data URI resource but the exact URI has been omitted. (If a client application wishes to show IRIs to human users, it is up to the client application to convert the URI into an IRI.)

Message Elements

The element info means an informational message or warning that does not affect the validity of the document being checked. The element error signifies a problem that causes the validation/checking to fail. non-document-error signifies an error that causes the checking to end in an indeterminate state because the document being validated could not be examined to the end. Examples of such errors include broken schemas, bugs in the validator and IO errors. (Note that when a schema has parse errors, they are first reported as errors and then a catch-all non-document-error is also emitted.)

Locator Attributes

The elements info, error and non-document-error have five optional attributes for indicating the context of the message: url, first-line, last-line, first-column and last-column. The first-column attribute must not be present unless the first-line attribute is present as well. The last-column attribute must not be present unless the last-line attribute is present as well. The first-line attribute must not be present unless the last-line attribute is present as well.

The url attribute, if present, must contain the URI (not IRI) of the resource with which the message is associated or the literal string “data:…” (the last character is U+2026) to signify that the message is associated with a data URI resource but the exact URI has been omitted. (If a client application wishes to show IRIs to human users, it is up to the client application to convert the URI into an IRI.)

If the url attribute is absent on the message element but present on the root element, the message is considered to be associated with the resource designated by the attribute on the root element.

The first-line, last-line, first-column and last-column attribute, if present, must contain a string consisting of characters in the range U+0030 DIGIT ZERO to U+0039 DIGIT NINE which when interpreted as a base-ten integer is a positive integer (not zero). The line and column numbers are one-based. The first line is line 1. The first column is column 1. Columns are counted by UTF-16 code units. A line break is considered to occupy the last column on the line it terminates.

The source lines and columns are approximate. For example, if a message is related to an attribute, the line and column may point to the first character if the start tag, the character after the start tag or to the attribute inside the tag depending on implementation. If a message is related to character data, the line and column may be inaccurate within a run of text e.g. due to buffering.

The last-line attribute indicates the last line (inclusive) onto which the source range associated with the message falls.

The first-line attribute indicates the first line onto which the source range associated with the message falls. If the attribute is missing, it is assumed to have the same value as last-line.

The last-column attribute indicates the last column (inclusive) onto which the source range associated with the message falls on the last line onto which is falls.

The first-column attribute indicates the first column onto which the source range associated with the message falls on the first line onto which is falls.

The type Attribute

The info, error and non-document-error element may have an attribute called type for indicating the type of the message in more detail.

The permissible value on the info element is warning, which means that the message seeks to warn about the user of a formally conforming but in some way questionable issue. Otherwise, the message is taken to generally informative.

The permissible value on the error element is fatal, which means that the error is an XML well-formedness error or, in the case of HTML, a condition that the implementor has opted to treat analogously to XML well-formedness errors (e.g. due to usability or performance considerations). Further errors are suppressed after a fatal error. In the absence of the type attribute, the element means a spec violation in general.

Permissible values on the non-document-error element are: io (signifies an input/output error), schema (indicates that initializing a schema-based validator failed) and internal (indicates that the validator/checker found an error bug in itself, ran out of memory, etc., but was still able to emit a message). In the absence of the type attribute, the element means a problem external to the document in general.

Children of Message Elements

The info, error and non-document-error elements may contain the following optional elements (in this order): message, elaboration (only if message is present as well) and extract.

The message Element

The message element represents a paragraph of text that is the message stated succinctly in natural language. Permissible element content consists of an interleaving of zero or more text nodes, zero or more a elements in the “http://www.w3.org/1999/xhtml” namespace and zero or more code elements in the “http://www.w3.org/1999/xhtml” namespace. The code elements in the “http://www.w3.org/1999/xhtml” namespace may contain text. The a elements in the “http://www.w3.org/1999/xhtml” namespace may contain an interleaving of zero or more text nodes and zero or more code elements in the “http://www.w3.org/1999/xhtml” namespace. The a elements in the “http://www.w3.org/1999/xhtml” namespace must have the attribute href and may have the attribute title.

The elaboration Element

The elaboration element provides additional human-readable guidance related to the message. The content model of this element is block level content (elements in the “http://www.w3.org/1999/xhtml” namespace) as defined by HTML 5.

The extract Element

The extract element represents an extract of the document source from around the point in source designated for the message by the line and column attributes on the message element. The extract element contains an interleaving zero or more text nodes and exactly one m element. The m element represents a highlighted part of the extract that pinpoints the source position associated with the message. The m element contains the highlighted part of the text. White space is significant in the subtree rooted at extract.

The source Element

The source element represents the source of the checked document as decoded to Unicode with XML-unsafe characters replaced with the REPLACEMENT CHARACTER and with line breaks replaced with U+00A0 LINE FEED. The element may contain text that is the source. White space is significant in the content.

The element has two optional attributes: type and encoding. The type attribute represents the media type of the input without parameters. The encoding attribute represents the charset media type parameter.

The parse-tree Element

The parse-tree element contains the information items of the parsed infoset that are the children of the document information item (recursively) encoded as follows:

Comment information items are not represented. Processing instruction information items are represented as element pi with the target in attribute target and data in content. Elements are represented as elements and attributes as attributes, but each namespace ns is substituted with a namespace http://n.validator.nu/?ns=escaped, where escaped is the URI (percent) escaped representation of the UTF-8 representation of ns.

  • The content of this element should probably be in SDF instead as suggested by zcorpan. hsivonen 14:59, 11 September 2007 (UTC)

Processing Model

Clients that consume the message format are referred to as processors. They must use a conforming XML 1.0 processor to parse the format.

If the root element is not an element named messages, the document is deemed to be in an unknown format and not processable according to this processing model.

If a processor encounters an element that it doesn’t recognize, it must process the content of the element as if the start tag and the end tag of the element were not there. If the processor encounter character data as a child of the root or a message element element (after applying the rule stated in the previous sentence), it must act as if the character data was not there. If a processor encounters an attribute that it does not recognize, it must ignore the entire attribute. If a processor encounters an attribute that it does recognize but the value of the attribute is not permissible under the previous section, the processor must ignore the entire attribute. If an info, error or non-document-error element does not have a last-line attribute with a permissible value, a last-column attribute on the element must be ignored if present. If an info, error or non-document-error element does not have a first-line attribute with a permissible value, a first-column attribute on the element must be ignored if present.

Processors must process elements in a way that is consistent with the semantics of the elements.

Determining Outcome

The outcome of the validation process may be success, failure or indeterminate.

  1. If there are one or more non-document-error elements, the outcome is indeterminate.
  2. Else if there are one or more error elements, the outcome is failure.
  3. Else the outcome is success.

Prior Art

The W3C has defined three XML output formats for the W3C Validator: the SOAP format, the Unicorn format and EARL. Relaxed has an XML format, but I’m not aware of a spec for it.

I think there are two problems with the SOAP and Unicorn formats: they are unnecessarily complex and they don’t support streaming output. For example, they require a redundant declaration of the number of errors before the errors themselves (which a client could count on its own if it wants to know the number).

The EARL format assumes that each testable condition has a well-known URI, which does not fit with grammar-based validation and now even with vanilla Schematron.

The W3C Validator also provides simple pass/fail information as HTTP headers, which is nice if you only care about a boolean pass/fail. However, this approach also has the problem the it precludes streaming, because the validation process has to finish before the HTTP headers can be written.

For these reasons, I am not particularly keen on reusing the output formats of the W3C Validator unless it turns out that there are significant network benefits to be reaped from plugging into an existing network of client software. It seems to me that there isn’t a significant network of existing client software.

See also