A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
Validator.nu XML Output
Goal
The native XML output format for Validator.nu. This format should be able to expose everything there is to expose in Validator.nu results. (Other XML formats may not fit Validator.nu exactly.)
Namespaces
The elements in this XML vocabulary are in the namespace
“http://n.validator.nu/messages/
”. This vocabulary reuses elements from the “http://www.w3.org/1999/xhtml
” namespace for human-readable messages. The semantics for the elements in the “http://www.w3.org/1999/xhtml
” namespace are defined in HTML 5.
- Perhaps the namespace URI should be a data: URI. If the ns URI does not contain any domain name, it cannot contain a domain name that someone is uncomfortable with. hsivonen 14:24, 18 December 2006 (UTC)
The attributes in this XML vocabulary are not in a namespace. The attribute values defined for this XML vocabulary must not have preceding or trailing white space.
Note: The format has been designed to support streaming generation
and consumption.
Structure and Semantics
The format consists of an XML 1.0 document that has the element
messages
as the root element.
The root elements may contain zero or more messages elements (info
,
error
and non-document-error
), followed by exactly one verdict element (success
,
failure
or indeterminate
), optionally followed by one source
element, optionally followed by one parse-tree
element.
Message Elements
The element info
means an informational message or warning that does not affect the validity of
the document being checked. The element error
signifies
a problem that causes the validation/checking to fail. non-document-error
signifies an error that causes the checking to end in an indeterminate state because
the document being validated could not be examined to the end. Examples of such errors include broken schemas, bugs in the validator and IO errors. (Note that when a schema has parse errors, they are first reported as error
s and then a catch-all non-document-error
is also emitted.)
Locator Attributes
The elements info
, error
and non-document-error
have three optional attributes for indicating the context of the
message: uri
, line
and column
.
The column
attribute must not be present unless the line
attribute is present as well.
The uri
attribute, if present, must containt the URI
(not IRI) of the HTTP resource with which the message is associated
or the literal string “data:…
” (the last character
is U+2026) to signify that the message is associated with a data URI
resource but the exact URI has been omitted. (If a client application
wishes to show IRIs to human users, it is up to the client
application to convert the URI into an IRI.)
The line
attribute, if present, must contain a string
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
DIGIT NINE which when interpreted as a base-ten integer is a positive
integer (not zero). This number means the approximate source text
line number associated with the message. The first line is 1.
The column
attribute, if present, must contain a
string consisting of characters in the range U+0030 DIGIT ZERO to
U+0039 DIGIT NINE which when interpreted as a base-ten integer is a
positive integer (not zero). This number means the approximate source
column number associated with the message on the line indicated by
the line
attribute. The first character on a line is in
column 1.
The source lines and columns are approximate. For example, if a message is related to an attribute, the line and column may point to the first character if the start tag, the character after the start tag or to the attribute inside the tag depending on implementation. If a message is related to character data, the line and column may be inaccurate within a run of text e.g. due to buffering. Furthermore, implementation may count column numbers in terms of UTF-16 code units instead of characters.
The type
Attribute
The info
, error
and non-document-error
element may have an attribute called type
for indicating the type of the message in more detail.
The permissible value on the info
element is warning
, which means that the message seeks to warn about the user of a formally conforming but in some way questionable issue. Otherwise, the message is taken to generally informative.
The permissible value on the error
element is fatal
, which means that the error is an XML well-formedness error or, in the case of HTML, a condition that the implementor has opted to treat analogously to XML well-formedness errors (e.g. due to usability or performance considerations). Further errors are suppressed after a fatal error. In the absence of the type
attribute, the element means a spec violation in general.
Permissible values on the non-document-error
element are: io
(signifies an
input/output error), schema
(indicates that
initializing a schema-based validator failed) and internal
(indicates that the validator/checker found an error bug in itself,
ran out of memory, etc., but was still able to emit a message). In the absence of the type
attribute, the element means a problem external to the document in general.
Children of Message Elements
Verdict Elements
The source
Element
The parse-tree
Element
Processing Model
Clients that consume the message format are referred to as processors. They must use a conforming XML 1.0 processor to parse the format.
If the root element is not an element named messages
,
the document is deemed to be in an unknown format and not processable
according to this processing model.
If a processor encounters an element that it doesn’t recognize,
it must process the content of the element as if the start tag and
the end tag of the element were not there. If the processor encounter
character data as a child of the root element (after applying the
rule stated in the previous sentence), it must act as if the
character data was not there. If a processor encounters an attribute
that it does not recognize, it must ignore the entire attribute. If a
processor encounters an attribute that it does recognize but the
value of the attribute is not permissible under the previous section,
the processor must ignore the entire attribute. If an info
,
warning
or error
element does not have a
line
attribute with a permissible value, a column
attribute on the element must be ignored if present.
Note: These rules make it possible to add markup for source code dumps, document outlines and parse trees later without breaking clients. Also, it make it possible to introduce e.g. XHTML markup in the human-readable messages.
Processors must process elements in a way that is consistent with the semantics of the elements.
The determine if the validation/checking succeeded, processors
must determine whether the root element has no error
element children. If there are no error
children, the
validation/checking succeeded. Otherwise, it failed.
Prior Art
The W3C has defined three XML output formats for the W3C Validator: the SOAP format, the Unicorn format and EARL.
I think there are two problems with the SOAP and Unicorn formats: they are unnecessarily complex and they don’t support streaming output. For example, they require a redundant declaration of the number of errors before the errors themselves (which a client could count on its own if it wants to know the number).
The EARL format assumes that each testable condition has a well-known URI, which does not fit with grammar-based validation and now even with vanilla Schematron.
The W3C Validator also provides simple pass/fail information as HTTP headers, which is nice if you only care about a boolean pass/fail. However, this approach also has the problem the it precludes streaming, because the validation process has to finish before the HTTP headers can be written.
For these reasons, I am not particularly keen on reusing the output formats of the W3C Validator unless it turns out that there are significant network benefits to be reaped from plugging into an existing network of client software. It seems to me that there isn’t a significant network of existing client software.