A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members) or send an e-mail to admin@wiki.whatwg.org with your desired username and an explanation of the first edit you'd like to make. (Do not use this e-mail address for any other inquiries, as they will be ignored or politely declined.)

Note: This wiki is used to supplement, not replace, specification discussions. If you would like to request changes to existing specifications, please use IRC or a mailing list first.

Talk:MetaExtensions

From WHATWG Wiki
Jump to: navigation, search

"description" meta name

I think the description name should be added to the HTML 5 specifications. Yes, search engines have made the keywords name obsolete. However search engines are not that good. It is still only the document author that can provide a reliable, short description of the documents contents. I think there should be constrains on how long, what it contains, and the structure of the description. Short sentences, and plain English descriptions would be the best.

It looks like "description" is in the latest specification. Rfc2549 01:46, 10 October 2008 (UTC)
Keywords still work. Descriptions are a good idea. However, HTML5 should not constrain search engines by specifying short sentences or any particular structure. HTML5 and this MetaExtensions Wiki should be permissive, especially for websites that depend on specialized or foreign search engines that may have other rank-driving preferences for meta tags. Rather, page authors should consider short sentences and the plain language you suggest because when they appear in search results they attract visitors. However, a site meant for physicians might have a different idea on what language is plain for their readers, so HTML can't usefully define that. Advice on good drafting is the province of search engines and various websites that report or advise. Nick 08:45, 20 April 2009 (UTC)
The html "title" meta name is not validating in our department's webpages, however it is included as a synonym under dcterms.title. We have no trouble with the html "description" validating properly which is also listed as a synonym under dcterms.description. anyone know why our html "title" gets an error message? Thanks.
The html "title" referenced to in the text is not an extension of the <meta> element, but instead the proper <title> HTML tag. According to the HTML specification, the title element «represents the document's title or name. Authors should use titles that identify their documents even when they are used out of contex». That's why it serves the same purpose of <meta name="dcterms.title" /> and it belongs to the metadata content category. (--Andy sky (talk) 14:59, 11 September 2013 (UTC))

keywords and description should not be unendorsed, should they?

Why are keywords and description unendorsed? Yahoo and Google do use them for searches, if not as much as when they began offering search services. And there are other search engines around the world, which might well support both of these. The unendorsing of cache is explained; what are the explanations for the other two? I see keywords, but not description, was unendorsed from the beginning, which suggests a misclassification by the original proposer. I think someone should reclassify both as proposals. Would it be okay if I did? Nick 08:55, 20 April 2009 (UTC)

On description: resolved. Ian Hickson reports that it's in HTML5, and it is, in section 4.2.5.1 (I had forgotten). On keywords: he didn't mention them in his reply but moved them within the Wiki from unendorsed to failed, and I've submitted a bug report for reconsideration of that decision: W3C Bug 6853. Nick 09:35, 29 April 2009 (UTC)

rights: why reversion

I'm reverting for these reasons:

The revision's added content was largely redundant of the link element rel="license" that, per <http://wiki.whatwg.org/wiki/RelExtensions>, is already in HTML5 (see section 6.12.3.9 of the W3C Working Draft of 8-25-09). The link uses URLs instead of standardized strings, but those URLs can be used the same way as the proposed strings, i.e., a search engine or UA can recognize the URLs without repeatedly fetching as equivalent to the strings. And the set of URLs is essentially extensible without requiring registration here, which simplifies the tag's use.

The proposed string "coprYYYY" is probably not legally sufficient notice even in the U.S., and around the world what that notice must state may vary. It's too much for us to list all the possibilities. Let the page author write such notice as they see fit and let the search engines store or refer to it as they see fit without abbreviating it.

I doubt the proposed code/content distinction can be legally defined in this way and the distinction recognized in a court of law applying copyright law. Most judges and lawyers probably don't know how Internet standards are promulgated. For example, arrangement is copyrightable and a judge might conclude that in copyright law code is one thing and arrangement of the code is another. I haven't researched case law on point but doing so is pointless when copyright law is in many nations that each have their own laws. My proposal is more flexible and thus better able to meet page authors' needs.

The legal needs require a way of writing arbitrary or free-form text (not arbitrary as lawyers use that term).

The revision is more complicated to implement.

The revision doesn't cover multiple licensing for one work. The HTML5 working draft carries an assertion of two licenses. So does Perl, the language. When two licenses are said to apply, a relationship has to be defined, perhaps that the user can choose which to apply, perhaps one applies for commercial use and the other for noncommercial use, or whatever. The revision apparently required multiple meta tags but didn't propose that. If a search engine might take multiple meta rights tags as conflicting, things can get confusing and legal rights may be lost.

The revision doesn't cover multiple licensing for multiple works on one page. Free-form text in the meta tag would allow that. For example, "The photograph may be available from the Permissions Department at . . .; the text is licensed under . . .; the arrangement is the property of . . .; and the music is licensed through BMI."

Thank you for the character entity edit.

Perhaps another meta keyword should be proposed, perhaps rights-standard, to use what was proposed? Would it serve a purpose that link wouldn't?

Thanks. Nick 07:58, 28 October 2009 (UTC)

I see what you're saying -- my point of view would be that it would be beneficial for a machine-readable/parsable value for use in searching, cataloging, etc. Free-form text makes this very difficult, if not impossible to apply to any degree. If the page author is providing a legal notice of copyright, then they must provide the country-appropriate copyright notice visible to the page viewer (in the content itself). Copyright Notice, Deposit, and Registration § 17, 4 U.S.C.§ 401 (2007)
My fear is without caselaw to guide us on the enforceability or applicability of this new way of embedding copyright in the code, we are setting up a standard that will not last past the first court case. Am I confused about the rationale behind this tag? What does it accomplish that a notice the page's viewable content area wouldn't?
>Perhaps another meta keyword should be proposed, perhaps rights-standard, to use what was proposed? I'll do just that -- thank you. BryanH 16:47, 28 October 2009 (UTC)
Notice calculated to be seen by most Web visitors will be visible in the UA's (viz., browser's) window or canvas, so it should be written to be visible there. But code or markup is not seen there. That's seen when a UA user either looks in the source code via the UA or receives it separately and doesn't expose it to the UA. The code will often include not the word "Copyright" or a "C" in a circle but a character entity that mostly only programmers will recognize when it is not interpreted. Without interpretation, "&copy; 2009 Lois Ng" does not meet U.S legal requirements for a copyright notice.
Therefore, I argue that a copyright notice calculated to be seen by someone looking at source code without benefit of a UA must be written to be visible there and humanly understandable in raw form. I use comments for that purpose, in addition to what's to be visible in the browser's window. But comments have a drawback of being usually not parseable by search engines (except for scripts, which use specially formatted comments). Since search engines copy our content, we need a way of embedding a copyright notice they have some way of identifying as being a copyright notice. But because legal requirements vary around the world, over time, and according to what is potentially subject to copyright, not to mention that some of us feel compelled to be extra cautious (either to capture all sorts of rights or because overclaiming can jeopardize an entire claim), there's no way to settle on a single form or a finite list for a copyright notice. A few forms are far more common than others, but it would take a while to research the form of choice to be used in Malawi next year for a sound recording by Madonna on tour if she's donating her work to a local school she's building (e.g., which nation's law applies). Thus, the string must be free-form.
But that limits what search engines can do with the string. With the meta rights proposal, likely the only parsing a machine can do is to recognize that it is a rights statement because it is in a tag and with a keyword that says so in accordance with the spec (which in turn supports the MetaExtensions page) and to recognize the string's length so that it can choose to either copy and display the string or only refer to it but in neither case truncate or edit it.
If a search engine programmer wants to try more sophisticated textual analysis, they'll succeed some of the time, but this very limited parseability is enough to flag a page as having an assertion of a right, which a human might choose to read, and if the human does not read it and infringes the copyright it'll be harder to pretend they infringed innocently, which is relevant to the remedy a U.S. court may apply.
The U.S. law you linked to on form of notice does not control other nations. While it says "or elsewhere", the government's legal authority outside the U.S. is severely limited regardless of what Congress passes and other jurisdictions may enact their own requirements. There is also, within the U.S., state law on copyright, rarely invoked, generally common law, and generally on work that is not in a fixed form; for which I don't know what the notice requirement, if any, is.
Possibly a court will reject anything we design, but generally courts require us to conform to already-existing law. Federal U.S. courts do not give advisory opinions and declaratory relief is expensive, unusual, and difficult to get, so we need to use the best judgment we can marshal now. If we want the benefits of intellectual property law and when it requires that we as owners and licensees give notice, we need to design a means for doing so without waiting for a court ruling on the specific design. Often pages have no rights assertion inside a page and suitable for code readers, and that's simply a deficiency. Thus, this tag.
This tag also can apply to other intellectual property rights. Software is, in some places, subject to patents and patent rights may be granted by their holder and trademark rights may apply, too.
Thanks for creating a rights-standard proposal. You might find more licenses available, such as, perhaps, the MIT license, the GPL the LGPL, the FreeBSD license, and who knows what else that might be applied to pages. While some licenses were explicitly designed for texts, others may be applicable to texts even if that wasn't in the license drafters' original intentions.
Nick 03:26, 29 October 2009 (UTC)

Based upon the discussion, I realize that trying to be too encompassing is/was too complicated. I've narrowed the scope to only media (i.e., images, video and other objects) and realigned the purpose: to enable search engines/crawlers to know if a page's objects have special rights for cataloging purposes. — BryanH 17:54, 13 July 2011 (UTC)

Is this media rights stuff broadly used or did you make it up just now? If you just made it up, you are duplicating functionality from the work licensing microdata vocabulary. Could you, please, check if the microdata vocab solves your problem and not reinvent yet another syntax if it does? hsivonen 07:25, 14 July 2011 (UTC)

Meta versus content

Why is content being put into the meta tag that only search engines can view? Wouldn't it be more useful to have this information in regular HTML code that is viewable to the end user while at the same time easily findable to search engines?

I will need to think about the details, but here is a start.

There is an HTML tag called "base"

<BASE HREF="http://www.example.com/">

That is the foundation of the website or part thereof of the website. All other parts go to that extension. By creating standard pages that are defined on the page -- Title Page A goes with content page A -- it will make it easier for both search engines and the end reader to find the information.

<TITLEPAGE HREF="titlepage.html">

On the titlepage, that is associated with that specific web page, one would expect to find all of the information that one would traditionally find on a title of a print source similar to a book or magazine.

There could be standard XML notation for this info: <TITLE> <SUBTITLE> <AUTHOR> <EDITOR> <PUBLISHER> <ILLUSTRATOR> <PUBLISH-ADDRESS> <COPYRIGHT> <DDC> (dewey decimal classification) <LLC> (library of congress classification) <ISBN> - maybe start a standard where people could register a website, and that organization verifies that the website's content match the info on the titlepage.

Whatever other names are standard in the industry.

<TOC-COMPACT HREF="toc-compact.html">

It points to the TOC of the site. Similair to a traditional TOC.

<TOC HREF="toc.html">

A more detailed TOC.

<ABOUTAUTHOR HREF="about_author.html">

An about the author page.

<INTRODUCTION HREF="introduction.html">

Information generally found in an introduction. Who is the target audience of the website, what are the goals of the website.

<PREVIOUS HREF="previous.html">

If the website is intended to be read like a book, what is the previous page to read?

<NEXT HREF="next.html">

What is the next web page to read?

keywords and descriptions

Personally, I don't understand why the keywords and descriptions are in the header as opposed to the main content. It is content. All content should be viewable to the end reader without having to look at the source code. Hidden titles, descriptions, and keywords would be the same as subliminal messages on a TV. They are being shown, but the user does not know what they are.

Also, by bringing them out into the open part of a web page, there is a less likelihood of abuse.

Maybe things should change so that instead of a search engine just reading the "head" part, they also read the new HTML5 tag "header" and "footer" as well.

Mnewman 17:08, 12 February 2010 (UTC)

> Why is content being put into the meta tag that only search engines can view?
> Wouldn't it be more useful to have this information in regular HTML code that
> is viewable to the end user while at the same time easily findable to search engines?
> . . . . .
> Personally, I don't understand why the keywords and descriptions are in the header
> as opposed to the main content. It is content.
Yes, in general, but there are legitimate exceptions. Google and probably most search engines recommend exactly what you suggest, even recommending that the primary subject be obvious in the lead paragraphs, in headlines marked with h1, h2, and similar elements, and in page titles. However, this also constrains writing to a style suited to search engine algorithmic extractions. Not all writers want to write that way and not all audiences need it; some prefer or demand other writing styles. Longer writings often begin with background that is not the central subject. Scholarship often distinguishes what will not be discussed and search engines can easily misunderstand the presence of a word as indicative of content rather than of noncontent. And the description meta tag serves the specific purpose of supplying the blurb that search engines display in results. Since a page author often knows what would better describe the page in two lines than would a search engine's extractive algorithm, the page author can provide the description meta tag and give searchers a better idea of what they'll find if they click the result for that page.
> Hidden titles, descriptions, and keywords would be the same as subliminal messages
> on a TV. They are being shown, but the user does not know what they are.
They're not subliminal; they're not visible at all. The head (I don't think you meant header) is not displayed unless the page author made a mistake or something's wrong with a browser or a computer. Only parts of the body should be visible.
As to subliminalities in the body, including in the header, Google and other search engines discourage the use of on-screen text that's too small or too low-contrast to be humanly readable under normal conditions and they do not meet accessibility standards. A website using that kind of styling is either an art site or abusive.
> Maybe things should change so that instead of a search engine
> just reading the "head" part, they also read the new HTML5 tag
> "header" and "footer" as well.
Search engines already read head and body and doubtless read header and footer as part of the body. They probably read everything in a page file, although they may discard parts for their permanent indexes.
> . . . . . On the titlepage, that is associated with that specific web page, one would
> expect to find all of the information that one would traditionally find on a title of
> a print source similar to a book or magazine.
Creating a page to bibliographically describe another page is already supported using rel="" and rev="" (see HTML5 and http://wiki.whatwg.org/wiki/RelExtensions) and, alternatively, the same bibliographic information can be put on the page with the content using various elements. Some of the meta tags proposed are reusing those previously in use under HTML 4.01. Perhaps the Dublin Core system would help, although I've had difficulty implementing it in (X)HTML and I don't know if search engines use DC.
Generally, there's a preference for reusing existing technology rather than inventing a new solution to an already-solved problem, so see if existing elements would already solve the problem you perceive or please identify a problem that has not been solved and then design a solution to fit that.
> There could be standard XML notation for this info . . . .
You can be inspired by XML, but for HTML5 try to use HTML (in a way that's compatible with XHTML) before going outside to XML. XML is very good for intranets (including intranets that can receive and process external submissions) and for industries and endeavors that can maintain their own standards, such as for chemistry, math, and site maps, but I'm not sure XML elements should be reserved for a purpose without a clearly recognizable administrative body to maintain that particular subset of elements in a standard. Since there are various kinds of publishing, several noncomputer standards would likely have to be combined. I don't know if it's worth the work.
The Dewey Decimal system is not of much use since it is copyrighted and permission is required for any nonlibrary use, including website classification.
The Universal system may not have much use. At least, I asked a librarian at a major library if she knew of anyone using it and she didn't. One librarian is not a large sample, so maybe it should be investigated further.
The Library of Congress Subject Headings system is good but I don't know what other nations use.
> maybe start a standard where people could register a website,
> and that organization verifies that the website's content match
> the info on the titlepage.
Sounds expensive. How much are reviewers' salaries and who writes the checks?
Thanks for the thoughts.
Nick 03:14, 18 February 2010 (UTC)

Mnewman's Feb. 12, 2010, deleted comments

I prepared a reply, since I thought they might have been deleted by accident, but they haven't come back, so probably not. If there's interest in these, feel free to indicate while I have the reply. Thanks. Nick 03:30, 18 February 2010 (UTC)

Re: Proposed 'creator' MetaExtension

I really don't see the purpose behind this proposal. If anything, this should be marked as a synonym for the 'author' MetaExtension already defined in HTML if it is found that this particular extension is aready seeing use. Otherwise, it should be marked as unendorsed.

--Codeguru413 18:27, 2 February 2011 (UTC)

They're different. The author is the author of the Web page. The creator is the creator of content that was independent of the Web until an author authored a Web page with the content. Thus, if you put Shakespeare's plays onto the Web you're the author (page author) and he was the creator. These are often different people and, when they are, they usually need separate identifications. Nick 03:37, 5 February 2011 (UTC)

Re: Proposed 'format-print' MetaExtension

I question the need for this as well. CSS already provides mechanisms by which to set the page size and such through the @media mechanism. However, it's an interesting proposal in that a value such as this (e.g. A4, Letter, etc.) is easily recognizable and has a meaning that provides values not only to OS's (or, more accurately, UAs), but also end-users. Perhaps this is more something that should be added to CSS as an alternative to explicitly setting the page size? Anyway, I'd be interested to hear more comments on this proposal.

--Codeguru413 18:36, 2 February 2011 (UTC)

Dublin Core metadata

Dublin Core (DC) metadata is in the list, but it needs work. Only two of the 15 DC elements are listed (although these are listed as DCTERMS, and these should be synonyms). The DC elements have been defined in IETF RFC 5013 [RFC5013], ANSI/NISO Standard Z39.85-2007 [NISOZ3985] and ISO Standard 15836:2009 [ISO15836], so I don't see how they can be left out.

Also, the Dublin Core Administrative Components (AC) are not listed; these are specified at http://biblstandard.dk/ac/. A couple of these AC elements, handling and action, have their own schemes, and I am not not sure how to handle these. If these are enumerated then it makes for a lot more keywords; for example, AC.handling becomes AC.handling.harvest, AC.handling.public, AC.handling.manual, AC.handling.keep, and AC.handling.mail.

What do people think about this? Should the full sets of DC elements, DCTERMS, and AC elements be included? Ian Hickson has suggested to me that what is more important than whether there are standards that define them, is whether there is any software that consumes them in a useful manner. I agree, but do not have this information. Although I have used AC and DC elements for years, it does not mean that others have. Martin.leese 03:01, 15 July 2012 (UTC)

Later. For full sets of these, I count there would be 55 DCTERMS (currently 55), 15 DC elements (currently 2), and 31 AC elements (currently zero). The high numbers are because Dublin Core is a comprehensive system for metadata. Given this many keywords, would it be more convenient for them to be in a separate table (or tables)? Martin.leese 20:35, 16 July 2012 (UTC)

Even later. A list of projects that use Dublin Core metadata is maintained here. Note that Dublin Core encourages the use of DCTERMS elements over DC (here), so the DC elements could be omitted. Finally, there is currently no specification that specifies the AC elements as HTML meta keywords, so they will have to wait. Martin.leese 04:04, 20 July 2012 (UTC)

Structured Data proposal

It is worth noting that the Dublin Core initiative started with RDF and it now seems suuitable to move towards a Structured Data RDFa implementation. I totally agree with this, because it is much more systematic and syntactically correct. However it involves several changes in the way DC metadata are expressed. Below a comparison between the 2 ways to express it.

Pure HTML syntax

(<meta name="">)

HTML+RDFa syntax

(<element property="">)

Namespace declaration a <link> sibling element:

rel="SCHEMA:[prefix]"

href="[namespace URI]"

the attribute prefix in the parent element, with this pattern:

prefix="[prefix]: [namespace URI]"

Property declaration name attribute property attribute
Property name [prefix]. (dot) [prefix]: (colon)
Namespace-prefix association Normative specs declaration

DCTERMS: http://purl.org/dc/terms/

DC: http://purl.org/dc/elements/1.1/

Conventional, suggested by common use

DCTERMS: http://purl.org/dc/terms/ (specific)

DCE: http://purl.org/dc/elements/1.1/ (specific)

DC: both (usually associated to http://purl.org/dc/terms/,the most recent standard)

Requires standardisation as enumerated Yes No - only DublinCore definition is required
Suitable elements <meta>; <link> elements never standardised. Any
Value type Formally, property in DC namespace identify resources, while DCTERMS subproperty allow literals/literal surrogate.

Substantially, only literals are allowed. Resource defined via URIs would require the use of a <link@rel> element, but property names have to be standardised as @rel (enumerated) values.

Literals, literal/non-literal surrogates, resources, URIs, specific datatypes (language, datetime), formatted text (XML only)

The RDFa syntax allows a broader variety of datatypes, each specific for the element on which the property is specified

Even if this page is dedicated to MetaExtensions, I think that authors should know which is the most modern and useful way to provide metadata. Notify me if there are errors, or if a further discussion is required. --Andy sky (talk) 13:12, 28 October 2013 (UTC)