A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Talk:MetaExtensions

From WHATWG Wiki
Revision as of 03:14, 18 February 2010 by Nick (talk | contribs) (→‎Meta versus content: Replied.)
Jump to navigation Jump to search

"description" meta name

I think the description name should be added to the HTML 5 specifications. Yes, search engines have made the keywords name obsolete. However search engines are not that good. It is still only the document author that can provide a reliable, short description of the documents contents. I think there should be constrains on how long, what it contains, and the structure of the description. Short sentences, and plain English descriptions would be the best.

It looks like "description" is in the latest specification. Rfc2549 01:46, 10 October 2008 (UTC)
Keywords still work. Descriptions are a good idea. However, HTML5 should not constrain search engines by specifying short sentences or any particular structure. HTML5 and this MetaExtensions Wiki should be permissive, especially for websites that depend on specialized or foreign search engines that may have other rank-driving preferences for meta tags. Rather, page authors should consider short sentences and the plain language you suggest because when they appear in search results they attract visitors. However, a site meant for physicians might have a different idea on what language is plain for their readers, so HTML can't usefully define that. Advice on good drafting is the province of search engines and various websites that report or advise. Nick 08:45, 20 April 2009 (UTC)

keywords and description should not be unendorsed, should they?

Why are keywords and description unendorsed? Yahoo and Google do use them for searches, if not as much as when they began offering search services. And there are other search engines around the world, which might well support both of these. The unendorsing of cache is explained; what are the explanations for the other two? I see keywords, but not description, was unendorsed from the beginning, which suggests a misclassification by the original proposer. I think someone should reclassify both as proposals. Would it be okay if I did? Nick 08:55, 20 April 2009 (UTC)

On description: resolved. Ian Hickson reports that it's in HTML5, and it is, in section 4.2.5.1 (I had forgotten). On keywords: he didn't mention them in his reply but moved them within the Wiki from unendorsed to failed, and I've submitted a bug report for reconsideration of that decision: W3C Bug 6853. Nick 09:35, 29 April 2009 (UTC)

rights: why reversion

I'm reverting for these reasons:

The revision's added content was largely redundant of the link element rel="license" that, per <http://wiki.whatwg.org/wiki/RelExtensions>, is already in HTML5 (see section 6.12.3.9 of the W3C Working Draft of 8-25-09). The link uses URLs instead of standardized strings, but those URLs can be used the same way as the proposed strings, i.e., a search engine or UA can recognize the URLs without repeatedly fetching as equivalent to the strings. And the set of URLs is essentially extensible without requiring registration here, which simplifies the tag's use.

The proposed string "coprYYYY" is probably not legally sufficient notice even in the U.S., and around the world what that notice must state may vary. It's too much for us to list all the possibilities. Let the page author write such notice as they see fit and let the search engines store or refer to it as they see fit without abbreviating it.

I doubt the proposed code/content distinction can be legally defined in this way and the distinction recognized in a court of law applying copyright law. Most judges and lawyers probably don't know how Internet standards are promulgated. For example, arrangement is copyrightable and a judge might conclude that in copyright law code is one thing and arrangement of the code is another. I haven't researched case law on point but doing so is pointless when copyright law is in many nations that each have their own laws. My proposal is more flexible and thus better able to meet page authors' needs.

The legal needs require a way of writing arbitrary or free-form text (not arbitrary as lawyers use that term).

The revision is more complicated to implement.

The revision doesn't cover multiple licensing for one work. The HTML5 working draft carries an assertion of two licenses. So does Perl, the language. When two licenses are said to apply, a relationship has to be defined, perhaps that the user can choose which to apply, perhaps one applies for commercial use and the other for noncommercial use, or whatever. The revision apparently required multiple meta tags but didn't propose that. If a search engine might take multiple meta rights tags as conflicting, things can get confusing and legal rights may be lost.

The revision doesn't cover multiple licensing for multiple works on one page. Free-form text in the meta tag would allow that. For example, "The photograph may be available from the Permissions Department at . . .; the text is licensed under . . .; the arrangement is the property of . . .; and the music is licensed through BMI."

Thank you for the character entity edit.

Perhaps another meta keyword should be proposed, perhaps rights-standard, to use what was proposed? Would it serve a purpose that link wouldn't?

Thanks. Nick 07:58, 28 October 2009 (UTC)

I see what you're saying -- my point of view would be that it would be beneficial for a machine-readable/parsable value for use in searching, cataloging, etc. Free-form text makes this very difficult, if not impossible to apply to any degree. If the page author is providing a legal notice of copyright, then they must provide the country-appropriate copyright notice visible to the page viewer (in the content itself). Copyright Notice, Deposit, and Registration § 17, 4 U.S.C.§ 401 (2007)
My fear is without caselaw to guide us on the enforceability or applicability of this new way of embedding copyright in the code, we are setting up a standard that will not last past the first court case. Am I confused about the rationale behind this tag? What does it accomplish that a notice the page's viewable content area wouldn't?
>Perhaps another meta keyword should be proposed, perhaps rights-standard, to use what was proposed? I'll do just that -- thank you. BryanH 16:47, 28 October 2009 (UTC)
Notice calculated to be seen by most Web visitors will be visible in the UA's (viz., browser's) window or canvas, so it should be written to be visible there. But code or markup is not seen there. That's seen when a UA user either looks in the source code via the UA or receives it separately and doesn't expose it to the UA. The code will often include not the word "Copyright" or a "C" in a circle but a character entity that mostly only programmers will recognize when it is not interpreted. Without interpretation, "&copy; 2009 Lois Ng" does not meet U.S legal requirements for a copyright notice.
Therefore, I argue that a copyright notice calculated to be seen by someone looking at source code without benefit of a UA must be written to be visible there and humanly understandable in raw form. I use comments for that purpose, in addition to what's to be visible in the browser's window. But comments have a drawback of being usually not parseable by search engines (except for scripts, which use specially formatted comments). Since search engines copy our content, we need a way of embedding a copyright notice they have some way of identifying as being a copyright notice. But because legal requirements vary around the world, over time, and according to what is potentially subject to copyright, not to mention that some of us feel compelled to be extra cautious (either to capture all sorts of rights or because overclaiming can jeopardize an entire claim), there's no way to settle on a single form or a finite list for a copyright notice. A few forms are far more common than others, but it would take a while to research the form of choice to be used in Malawi next year for a sound recording by Madonna on tour if she's donating her work to a local school she's building (e.g., which nation's law applies). Thus, the string must be free-form.
But that limits what search engines can do with the string. With the meta rights proposal, likely the only parsing a machine can do is to recognize that it is a rights statement because it is in a tag and with a keyword that says so in accordance with the spec (which in turn supports the MetaExtensions page) and to recognize the string's length so that it can choose to either copy and display the string or only refer to it but in neither case truncate or edit it.
If a search engine programmer wants to try more sophisticated textual analysis, they'll succeed some of the time, but this very limited parseability is enough to flag a page as having an assertion of a right, which a human might choose to read, and if the human does not read it and infringes the copyright it'll be harder to pretend they infringed innocently, which is relevant to the remedy a U.S. court may apply.
The U.S. law you linked to on form of notice does not control other nations. While it says "or elsewhere", the government's legal authority outside the U.S. is severely limited regardless of what Congress passes and other jurisdictions may enact their own requirements. There is also, within the U.S., state law on copyright, rarely invoked, generally common law, and generally on work that is not in a fixed form; for which I don't know what the notice requirement, if any, is.
Possibly a court will reject anything we design, but generally courts require us to conform to already-existing law. Federal U.S. courts do not give advisory opinions and declaratory relief is expensive, unusual, and difficult to get, so we need to use the best judgment we can marshal now. If we want the benefits of intellectual property law and when it requires that we as owners and licensees give notice, we need to design a means for doing so without waiting for a court ruling on the specific design. Often pages have no rights assertion inside a page and suitable for code readers, and that's simply a deficiency. Thus, this tag.
This tag also can apply to other intellectual property rights. Software is, in some places, subject to patents and patent rights may be granted by their holder and trademark rights may apply, too.
Thanks for creating a rights-standard proposal. You might find more licenses available, such as, perhaps, the MIT license, the GPL the LGPL, the FreeBSD license, and who knows what else that might be applied to pages. While some licenses were explicitly designed for texts, others may be applicable to texts even if that wasn't in the license drafters' original intentions.
Nick 03:26, 29 October 2009 (UTC)

Meta versus content

Why is content being put into the meta tag that only search engines can view? Wouldn't it be more useful to have this information in regular HTML code that is viewable to the end user while at the same time easily findable to search engines?

I will need to think about the details, but here is a start.

There is an HTML tag called "base"

<BASE HREF="http://www.example.com/">

That is the foundation of the website or part thereof of the website. All other parts go to that extension. By creating standard pages that are defined on the page -- Title Page A goes with content page A -- it will make it easier for both search engines and the end reader to find the information.

<TITLEPAGE HREF="titlepage.html">

On the titlepage, that is associated with that specific web page, one would expect to find all of the information that one would traditionally find on a title of a print source similar to a book or magazine.

There could be standard XML notation for this info: <TITLE> <SUBTITLE> <AUTHOR> <EDITOR> <PUBLISHER> <ILLUSTRATOR> <PUBLISH-ADDRESS> <COPYRIGHT> <DDC> (dewey decimal classification) <LLC> (library of congress classification) <ISBN> - maybe start a standard where people could register a website, and that organization verifies that the website's content match the info on the titlepage.

Whatever other names are standard in the industry.

<TOC-COMPACT HREF="toc-compact.html">

It points to the TOC of the site. Similair to a traditional TOC.

<TOC HREF="toc.html">

A more detailed TOC.

<ABOUTAUTHOR HREF="about_author.html">

An about the author page.

<INTRODUCTION HREF="introduction.html">

Information generally found in an introduction. Who is the target audience of the website, what are the goals of the website.

<PREVIOUS HREF="previous.html">

If the website is intended to be read like a book, what is the previous page to read?

<NEXT HREF="next.html">

What is the next web page to read?

keywords and descriptions

Personally, I don't understand why the keywords and descriptions are in the header as opposed to the main content. It is content. All content should be viewable to the end reader without having to look at the source code. Hidden titles, descriptions, and keywords would be the same as subliminal messages on a TV. They are being shown, but the user does not know what they are.

Also, by bringing them out into the open part of a web page, there is a less likelihood of abuse.

Maybe things should change so that instead of a search engine just reading the "head" part, they also read the new HTML5 tag "header" and "footer" as well.

Mnewman 17:08, 12 February 2010 (UTC)

> Why is content being put into the meta tag that only search engines can view?
> Wouldn't it be more useful to have this information in regular HTML code that
> is viewable to the end user while at the same time easily findable to search engines?
> . . . . .
> Personally, I don't understand why the keywords and descriptions are in the header
> as opposed to the main content. It is content.
Yes, in general, but there are legitimate exceptions. Google and probably most search engines recommend exactly what you suggest, even recommending that the primary subject be obvious in the lead paragraphs, in headlines marked with h1, h2, and similar elements, and in page titles. However, this also constrains writing to a style suited to search engine algorithmic extractions. Not all writers want to write that way and not all audiences need it; some prefer or demand other writing styles. Longer writings often begin with background that is not the central subject. Scholarship often distinguishes what will not be discussed and search engines can easily misunderstand the presence of a word as indicative of content rather than of noncontent. And the description meta tag serves the specific purpose of supplying the blurb that search engines display in results. Since a page author often knows what would better describe the page in two lines than would a search engine's extractive algorithm, the page author can provide the description meta tag and give searchers a better idea of what they'll find if they click the result for that page.
> Hidden titles, descriptions, and keywords would be the same as subliminal messages
> on a TV. They are being shown, but the user does not know what they are.
They're not subliminal; they're not visible at all. The head (I don't think you meant header) is not displayed unless the page author made a mistake or something's wrong with a browser or a computer. Only parts of the body should be visible.
As to subliminalities in the body, including in the header, Google and other search engines discourage the use of on-screen text that's too small or too low-contrast to be humanly readable under normal conditions and they do not meet accessibility standards. A website using that kind of styling is either an art site or abusive.
> Maybe things should change so that instead of a search engine
> just reading the "head" part, they also read the new HTML5 tag
> "header" and "footer" as well.
Search engines already read head and body and doubtless read header and footer as part of the body. They probably read everything in a page file, although they may discard parts for their permanent indexes.
> . . . . . On the titlepage, that is associated with that specific web page, one would
> expect to find all of the information that one would traditionally find on a title of
> a print source similar to a book or magazine.
Creating a page to bibliographically describe another page is already supported using rel="" and rev="" (see HTML5 and http://wiki.whatwg.org/wiki/RelExtensions) and, alternatively, the same bibliographic information can be put on the page with the content using various elements. Some of the meta tags proposed are reusing those previously in use under HTML 4.01. Perhaps the Dublin Core system would help, although I've had difficulty implementing it in (X)HTML and I don't know if search engines use DC.
Generally, there's a preference for reusing existing technology rather than inventing a new solution to an already-solved problem, so see if existing elements would already solve the problem you perceive or please identify a problem that has not been solved and then design a solution to fit that.
> There could be standard XML notation for this info . . . .
You can be inspired by XML, but for HTML5 try to use HTML (in a way that's compatible with XHTML) before going outside to XML. XML is very good for intranets (including intranets that can receive and process external submissions) and for industries and endeavors that can maintain their own standards, such as for chemistry, math, and site maps, but I'm not sure XML elements should be reserved for a purpose without a clearly recognizable administrative body to maintain that particular subset of elements in a standard. Since there are various kinds of publishing, several noncomputer standards would likely have to be combined. I don't know if it's worth the work.
The Dewey Decimal system is not of much use since it is copyrighted and permission is required for any nonlibrary use, including website classification.
The Universal system may not have much use. At least, I asked a librarian at a major library if she knew of anyone using it and she didn't. One librarian is not a large sample, so maybe it should be investigated further.
The Library of Congress Subject Headings system is good but I don't know what other nations use.
> maybe start a standard where people could register a website,
> and that organization verifies that the website's content match
> the info on the titlepage.
Sounds expensive. How much are reviewers' salaries and who writes the checks?
Thanks for the thoughts.
Nick 03:14, 18 February 2010 (UTC)