A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Difference between revisions of "Generic Metadata Mechanisms"

From WHATWG Wiki
Jump to navigation Jump to search
(+spec)
 
(11 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{obsolete|spec=[http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#microdata HTML Standard: Microdata]}}
There have been some requests for introducing generic metadata mechanisms into HTML5.
There have been some requests for introducing generic metadata mechanisms into HTML5.


Line 6: Line 8:


= Goals =
= Goals =
== What is the problem we are trying to solve? ==
== What is the problem we are trying to solve? ==
 
See [[Microdata Problem Descriptions]].
A machine-readable and standardized way to apply semantic properties (metadata) to DOM elements in HTML5 and probably XHTML. These properties are capable of being disambiguated between multiple definitions of the property name. We should be able to find or define an "authoritive" meaning for an abstract concept like "title" (eg. book title, job title, person's title, land deed, etc...). The metadata could be read by UA's and other tools to perform actions that would not be possible without "knowing" what type of thing, quantity, unit or quality an element represents.


== Who faces this problem? ==
== Who faces this problem? ==
Currently a few groups. In the future metadata may become necessary for the average "web consumer" (human or machine) to sort actual information from presentation and structural cruft. In other words, a useful tool for determining the meaning, terms of use, quality and/or authority of a piece of data inside (X)HTML.
Currently a few groups. In the future metadata may become necessary for the average "web consumer" (human or machine) to sort actual information from presentation and structural cruft. In other words, a useful tool for determining the meaning, terms of use, quality and/or authority of a piece of data inside (X)HTML.
<!--
'''This section needs to be much, much more detailed. Who exactly faces the problem we're trying to solve? Name names of communities, organizations, companies, etc; show how they are "suffering" today and how they are currently working around the problem.'''
-->


= Requirements: If we assume that we are going to address this need, what do we need to provide? =
= Requirements: If we assume that we are going to address this need, what do we need to provide? =
Please demonstrate the reasoning behind each requirement, along with examples of how the requirements could be addressed.
== Machine-readable ==
A machine-readable and standardized way to apply semantic properties (metadata) to DOM elements in HTML5 and XHTML5.
<!--
'''Needs more detail. What does it mean to be machine-readable?'''
-->


Please list each requirement in its own subsection so that arguments pro and con and links to supporting research can be included.
== Disambiguation ==
These properties are capable of being disambiguated between multiple definitions of the property name.


== The DOM has to be consistent between HTML and XHTML representations ==
== Finding or defining meaning ==
We should be able to find or define an "authoritative" meaning for an abstract concept like "title" (eg. book title, job title, person's title, land deed, etc...).
<!--
'''Needs more detail. What does it mean to "find" an authoritative meaning?'''
-->
== Machine-usability ==
The metadata could be read by UA's and other tools to perform actions that would not be possible without "knowing" what type of thing, quantity, unit or quality an element represents.
<!--
'''Needs more detail. How can this work? What does it mean in practice?'''
-->
== DOM consistency ==
The DOM has to be consistent between the HTML and XHTML representations of the HTML5 specification. If it isn't, then migrating between the two becomes non-trivial, especially for scripting.


* Pro: If it isn't, then migrating between the two becomes complicated.
== Ease of deployment ==
The syntax has to be something that Web authors can easily deploy. If authors can't deploy this, then it won't get critical mass and won't matter.


== The syntax has to be something that Web authors can easily deploy ==
One could argue that tools will be used to deploy this, that it'll mostly be used by big sites like Facebook, and that thus individual authors don't matter, but this kind of argument ("the tools will save us") has been repeatedly shown to not work, because in practice the tools have to be hand-authored too, and so the complexity is just moved to other people.


* Pro: If authors can't deploy this, then it won't get critical mass and won't matter
== Inlinability ==
It has to have a way to include it inline, so that it is quicker for non-professional developers to use and adopt. Also, putting metadata in the same location as content could prevent errors in updates or copying.


* Con: Tools will be used to deploy this. It'll mostly be used by big sites like Facebook. So individual authors don't matter.
== Abstractability ==
It has to have both a way to abstract it from the HTML, like JS or CSS.
<!--
'''Needs more detail. Why does it need to be abstractable?'''
-->
== Sustainability ==
Where possible the proposal should be resistant to temporary or permanent unavailability of an authoritative source (ie, vocabulary provider). This could be achieved, for example, through a P2P or DNS-like mechanism, or by not relying on external sources (e.g. in the way that SSL certificates are checked).  


== It has to have both a way to abstract it from the HTML and a way to include it inline ==
Not doing this would lead to failures during temporary outages or overloading of an authoritative source of metadata definitions, and may make it more resistant to hostile takeover or shutdown of authority.


* Examples: Javascript and CSS
Distributing an authoritative source needs not make it less authoritative.
* Pro: More flexible for professional developers
* Con: More complicated than having either.


=== Inline (any method) ===
== Reuse ==
The proposal should allow metadata and authoritative sources to be reused across elements, pages and sites, because web developers are more likely to use something that does not require repetitively typing the same data.
<!--
'''Needs more detail. Reuse how?'''
-->
== Multilingual and Multicultural ==
Not all concepts can be expressed properly in English. A proposal should allow metadata for foreign languages and concepts.


* Pro: Inline it may be quicker for non-professional developers to use and adopt.
== Authority and Security ==
* Pro: Metadata is in same location as content which could prevent errors in updates or copying.
Since a potential use of metadata appears to be enabling future features of UAs and other tools it follows that this opens the end-user to additional risks. For example, could a page author or hijacker feed a virus to a tool by falsely claiming it to be another type of data; could harm be caused when a metadata authority is hijacked by a group to deliberately mislead or blackmail; could metadata be used for unintended purposes such as spying on or annoying users.


* Con: Looks messy, clutters HTML.
With these risks in mind should there be standard mechanisms for securing metadata and verifying its source (such as signing certificates, encryption or white/black lists).


= Related Proposals, Research and Discussions =
* [http://www.mail-archive.com/[email protected]/index.html#11037 WHATWG Discussions]
* [http://www.w3.org/2001/sw/interest/ w3c Semantic Web Interest Group (SWIG)]
* [http://lists.w3.org/Archives/Public/semantic-web/ W3C SWIG Mailing List Archive]
* [http://microformats.org/wiki/grddl GRRDL (Transformations of XHTML to RDF)]
* [http://www.xanthir.com/rdfa-vs-crdf.php RDFa vs. CRDF (Cascading RDF Proposal)]
* [http://research.talis.com/2005/erdf/wiki Embedded RDF Wiki]
* [http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml RDF in HTML (Embedded RDF Examples)]
* [http://en.wikipedia.org/wiki/Semantic_web Wikipedia page on Semantic Web]
* [http://microformats.org/wiki/what-are-microformats What are Microformats? (microformats.org)]
* [http://www.foaf-project.org/ Friend of a Friend Project (FOAF)]
* [http://dublincore.org/ Dublin Core Metadata Initiative (DCMI)]
= Pre-Existing Software Systems That Demonstrate A Need =
* [http://www.kaply.com/weblog/operator/ Operator] - A semantic web processor for extracting metadata from all forms of HTML embedded by using Microformats and RDFa.
* [http://rdfa.digitalbazaar.com/fuzzbot/ Fuzzbot] - A semantic web processor for extracting triples from HTML4 and XHTML1.0, 1.1 and 2.0 data sources.
* [http://simile.mit.edu/wiki/Longwell Longwell] - Longwell is a web-based RDF-powered highly-configurable faceted browser.
* [http://simile.mit.edu/wiki/Piggy_Bank Piggy Bank] - Piggy Bank is a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together. Piggy Bank also allows you to store this extracted information locally for you to search later and to exchange at need the collected information with others.
* [http://simile.mit.edu/wiki/Solvent Solvent] - Solvent is a Firefox extension that helps you write screen scrapers for Piggy Bank.
* [http://simile.mit.edu/wiki/Semantic_Bank Semantic Bank] - Semantic Bank is the server companion of Piggy Bank that lets you persist, share and publish data collected by individuals, groups or communities.
* [http://simile.mit.edu/wiki/Crowbar Crowbar] -  Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.
* [http://simile.mit.edu/wiki/Referee Referee] - Referee is a program that reads your web server logs and crawls your referrers (the links that point to your pages) and extract metadata from those pages and text around the links that pointed to your pages.
== Proposals ==
=== Inline (as multiple attributes) ===
=== Inline (as multiple attributes) ===
Multiple new metadata attributes such as in RDFa.
Multiple new metadata attributes such as in RDFa.


Line 50: Line 106:
* Con: Dependent on changes to HTML spec for future changes to metadata spec.
* Con: Dependent on changes to HTML spec for future changes to metadata spec.
* Con: Would probably require a different syntax for block or external version of same metadata (makes it hard to move).
* Con: Would probably require a different syntax for block or external version of same metadata (makes it hard to move).
* Con: Requires documentation and standardization in the HTML spec rather than through a seperate document and standards body.
* Con: Requires documentation and standardization in the HTML spec rather than through a separate document and standards body.
* Con: More potential for attribute name collisions with future HTML attributes.
* Con: More potential for attribute name collisions with future HTML attributes.
* Con: Appears to make metadata reuse difficult.
* Con: Appears to make metadata reuse difficult.


=== Inline (in a single attribute) ===
=== Inline (in a single attribute) ===
One metadata attribute with complex content (such as the style attribute)
One metadata attribute with complex content (such as the style attribute)


Line 67: Line 122:


=== Block or external metadata ===
=== Block or external metadata ===
Metadata is defined elsewhere from element and targeted in the manner of CSS or Javascript.
Metadata is defined elsewhere from element and targeted in the manner of CSS or Javascript.


Line 78: Line 132:
* Con: Requires new metadata format to be created.
* Con: Requires new metadata format to be created.
* Con: CSS-like targeting or use of class or id to apply metadata adds complexity/indirection.
* Con: CSS-like targeting or use of class or id to apply metadata adds complexity/indirection.
* Con: Extra HTTP-request


== Sustainability ==
[[Category:Proposals]]
 
Where possible the proposal should be resistant to temporary or permanent unavailability of an authoritative source (ie, vocabulary provider). This could be acheived, for example, through a P2P or DNS-like mechanism.
 
* Pro: Metadata remains usable during temporary outage or overload of an authoritive source of metadata definitions.
* Pro: Possibly, but not necessarily, more resistant to hostile takeover or shutdown of authority.
* Con: Probably makes proposal more technically complex and may be difficult, expensive or impossible to solve.
* Con: Distributing an authoritive source may make it less authoritive.
 
== Reuse ==
 
The proposal should allow metadata and authoritive sources to be reused across elements, pages and sites.
 
* Pro: Web developers are more likely to use something that does not require repetitively typing the same data.
* Con: Proposal may be more complex.
* Con: Reuse may change or dilute the semantics of the metadata.
 
== Multilingual and Multicultural ==
 
Not all concepts can be expressed properly in English. A proposal should allow metadata for foreign languages and concepts.
 
* Pro: Web developers are more likely to use something in their own language.
* Pro: More concepts and measures can be expressed.
* Con: Proposal would be more complex.
* Con: The author would need to correctly define the region or culture being used.
 
== Authority and Security ==
 
Since a potential use of metadata appears to be enabling future features of UAs and other tools it follows that this opens the end-user to additional risks. For example could a page author or hijacker feed a virus to a tool by falsely claiming it to be another type of data. In addition could harm be caused when a metadata authority is hijacked by a group to deliberately mislead or blackmail.
 
In addition could metadata be used for unintended purposes such as spying on or annoying users.
 
With these risks in mind should there be standard mechanisms for securing metadata and verifying its source (such as signing certificates, encryption or white/black lists)
 
* Pro: Web users and tool vendors are more likely to enable a feature that presents minimal risk.
* Pro: More trust can be placed in the accuracy, purpose and meaning of a piece of data.
* Con: Proposal would be more complex.
* Con: Certificates themselves could be hijacked.
* Con: Certificate fees or vendor preferences could place minority groups at a disadvantage when becoming an authoritive source.
 
== Choice of format ==
 
There are already several metadata formats. In the future there may be more.
 
* Pro: Metadata could be directly repurposed from another system (like ID3s in a music collection) without conversion.
* Pro: A future metadata technology might become dominant in non-HTML systems (like libraries or operating systems) and become the standard for everything but the web.
* Pro: Unforeseen faults or limitations of the current system may require it to be gradually phased out in favor of something else without breaking older sites.
* Con: More complexity for UA and tool developers.
* Con: Reduces the possibility of getting a single global metadata standard (which may or may not be a good thing).
 
= Related Proposals, Research and Discussions =
 
* [http://www.mail-archive.com/[email protected]/index.html#11037 WHATWG Discussions]
* [http://www.w3.org/2001/sw/interest/ w3c Semantic Web Interest Group (SWIG)]
* [http://lists.w3.org/Archives/Public/semantic-web/ W3C SWIG Mailing List Archive]
* [http://microformats.org/wiki/grddl GRRDL (Transformations of XHTML to RDF)]
* [http://www.xanthir.com/rdfa-vs-crdf.php RDFa vs. CRDF (Cascading RDF Proposal)]
* [http://research.talis.com/2005/erdf/wiki Embedded RDF Wiki]
* [http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml RDF in HTML (Embedded RDF Examples)]
* [http://en.wikipedia.org/wiki/Semantic_web Wikipedia page on Semantic Web]
* [http://microformats.org/wiki/what-are-microformats What are Microformats? (microformats.org)]
* [http://www.foaf-project.org/ Friend of a Friend Project (FOAF)]
* [http://dublincore.org/ Dublin Core Metadata Initiative (DCMI)]

Latest revision as of 16:13, 10 November 2012

This document is obsolete.

For the current specification, see: HTML Standard: Microdata


There have been some requests for introducing generic metadata mechanisms into HTML5.

To help determine what we would need to add, and whether it is worth adding anything, we have to come to an understanding of what the goals and requirements are of such a proposal.

Please document arguments with links to supporting research or links to other wiki pages detailing the anecdotal evidence for or against particular aspects of the goals and requirements.

Goals

What is the problem we are trying to solve?

See Microdata Problem Descriptions.

Who faces this problem?

Currently a few groups. In the future metadata may become necessary for the average "web consumer" (human or machine) to sort actual information from presentation and structural cruft. In other words, a useful tool for determining the meaning, terms of use, quality and/or authority of a piece of data inside (X)HTML.

Requirements: If we assume that we are going to address this need, what do we need to provide?

Please demonstrate the reasoning behind each requirement, along with examples of how the requirements could be addressed.

Machine-readable

A machine-readable and standardized way to apply semantic properties (metadata) to DOM elements in HTML5 and XHTML5.

Disambiguation

These properties are capable of being disambiguated between multiple definitions of the property name.

Finding or defining meaning

We should be able to find or define an "authoritative" meaning for an abstract concept like "title" (eg. book title, job title, person's title, land deed, etc...).

Machine-usability

The metadata could be read by UA's and other tools to perform actions that would not be possible without "knowing" what type of thing, quantity, unit or quality an element represents.

DOM consistency

The DOM has to be consistent between the HTML and XHTML representations of the HTML5 specification. If it isn't, then migrating between the two becomes non-trivial, especially for scripting.

Ease of deployment

The syntax has to be something that Web authors can easily deploy. If authors can't deploy this, then it won't get critical mass and won't matter.

One could argue that tools will be used to deploy this, that it'll mostly be used by big sites like Facebook, and that thus individual authors don't matter, but this kind of argument ("the tools will save us") has been repeatedly shown to not work, because in practice the tools have to be hand-authored too, and so the complexity is just moved to other people.

Inlinability

It has to have a way to include it inline, so that it is quicker for non-professional developers to use and adopt. Also, putting metadata in the same location as content could prevent errors in updates or copying.

Abstractability

It has to have both a way to abstract it from the HTML, like JS or CSS.

Sustainability

Where possible the proposal should be resistant to temporary or permanent unavailability of an authoritative source (ie, vocabulary provider). This could be achieved, for example, through a P2P or DNS-like mechanism, or by not relying on external sources (e.g. in the way that SSL certificates are checked).

Not doing this would lead to failures during temporary outages or overloading of an authoritative source of metadata definitions, and may make it more resistant to hostile takeover or shutdown of authority.

Distributing an authoritative source needs not make it less authoritative.

Reuse

The proposal should allow metadata and authoritative sources to be reused across elements, pages and sites, because web developers are more likely to use something that does not require repetitively typing the same data.

Multilingual and Multicultural

Not all concepts can be expressed properly in English. A proposal should allow metadata for foreign languages and concepts.

Authority and Security

Since a potential use of metadata appears to be enabling future features of UAs and other tools it follows that this opens the end-user to additional risks. For example, could a page author or hijacker feed a virus to a tool by falsely claiming it to be another type of data; could harm be caused when a metadata authority is hijacked by a group to deliberately mislead or blackmail; could metadata be used for unintended purposes such as spying on or annoying users.

With these risks in mind should there be standard mechanisms for securing metadata and verifying its source (such as signing certificates, encryption or white/black lists).

Related Proposals, Research and Discussions

Pre-Existing Software Systems That Demonstrate A Need

  • Operator - A semantic web processor for extracting metadata from all forms of HTML embedded by using Microformats and RDFa.
  • Fuzzbot - A semantic web processor for extracting triples from HTML4 and XHTML1.0, 1.1 and 2.0 data sources.
  • Longwell - Longwell is a web-based RDF-powered highly-configurable faceted browser.
  • Piggy Bank - Piggy Bank is a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together. Piggy Bank also allows you to store this extracted information locally for you to search later and to exchange at need the collected information with others.
  • Solvent - Solvent is a Firefox extension that helps you write screen scrapers for Piggy Bank.
  • Semantic Bank - Semantic Bank is the server companion of Piggy Bank that lets you persist, share and publish data collected by individuals, groups or communities.
  • Crowbar - Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.
  • Referee - Referee is a program that reads your web server logs and crawls your referrers (the links that point to your pages) and extract metadata from those pages and text around the links that pointed to your pages.

Proposals

Inline (as multiple attributes)

Multiple new metadata attributes such as in RDFa.

  • Pro: Reasonably simple to add to spec.
  • Con: Dependent on changes to HTML spec for future changes to metadata spec.
  • Con: Would probably require a different syntax for block or external version of same metadata (makes it hard to move).
  • Con: Requires documentation and standardization in the HTML spec rather than through a separate document and standards body.
  • Con: More potential for attribute name collisions with future HTML attributes.
  • Con: Appears to make metadata reuse difficult.

Inline (in a single attribute)

One metadata attribute with complex content (such as the style attribute)

  • Pro: New properties can be added without changing the HTML spec.
  • Pro: Changing properties does not affect the DOM.
  • Pro: The properties are grouped together.
  • Pro: Requirements very similar to style="" and onclick="".
  • Con: Requires new metadata format to be created.
  • Con: Makes it harder to select individual property/value pairs through CSS or DOM scripting. (Might require dedicated APIs... Ugh.)

Block or external metadata

Metadata is defined elsewhere from element and targeted in the manner of CSS or Javascript.

  • Pro: Does not clutter the HTML.
  • Pro: Gives it more space to develop such as style did once it was abstracted from HTML.
  • Pro: May be easier to import, export, reuse, sign and translate.
  • Pro: May be applied to elements that the author cannot change attributes on (eg, dynamic, protected or generated content).
  • Pro: Speed up where external metadata can be cached.
  • Con: Requires new metadata format to be created.
  • Con: CSS-like targeting or use of class or id to apply metadata adds complexity/indirection.
  • Con: Extra HTTP-request