A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Microdata Problem Descriptions: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(68 intermediate revisions by the same user not shown)
Line 1: Line 1:
* Web browsers should be able to help users find information related to
== Exposing known data types in a reusable way ==
the items that page they are looking at discusses.
-
* Search engines should be able to determine the contents of pages with
more accuracy than today.


* Exposing calendar events so that users can add those events to their  
USE CASE: Exposing calendar events so that users can add those events to their calendaring systems.
calendaring systems.


* Exposing music samples on a page so that a user can listen to all the  
SCENARIOS:
samples.
* A user visits the Avenue Q site and wants to make a note of when tickets go on sale for the tour's stop in his home town. The site says "October 3rd", so the user clicks this and selects "add to calendar", which causes an entry to be added to his calendar.
* A student is making a timeline of important events in Apple's history. As he reads Wikipedia entries on the topic, he clicks on dates and selects "add to timeline", which causes an entry to be added to his timeline.
* TV guide listings - browsers should be able to expose to the user's tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
* Paul sometimes gives talks on various topics, and announces them on his blog. He would like to mark up these announcements with proper scheduling information, so that his readers' software can automatically obtain the scheduling information and add it to their calendar. Importantly, some of the rendered data might be more informal than the machine-readable data required to produce a calendar event. ''Also of importance: Paul may want to annotate his event with a combination of existing vocabularies and a new vocabulary of his own design. (why?)''
* David can use the data in a web page to generate a custom browser UI for adding an event to our calendaring software without using brittle screen-scraping.
* http://livebrum.co.uk/: the author would like people to be able to grab events and event listings from his site and put them on their site with as much information as possible retained. "The fantasy would be that I could provide code that could be cut and pasted into someone else's HTML so the average blogger could re-use and re-share my data."
* User should be able to subscribe to http://livebrum.co.uk/ then sort by date and see the items sorted by event date, not publication date.


* Getting data out of poorly written Web pages, so that the user can find  
REQUIREMENTS:
more information about the page's contents.
* Should be discoverable.
* Should be compatible with existing calendar systems.
* Should be unlikely to get out of sync with prose on the page.
* Shouldn't require the consumer to write XSLT or server-side code to read the calendar information.
* Machine-readable event data shouldn't be on a separate page than human-readable dates.
* The information should be convertible into a dedicated form (RDF, JSON, XML, iCalendar) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
* Should be possible for different parts of an event to be given in different parts of the page. For example, a page with calendar events in columns (with each row giving the time, date, place, etc) should still have unambiguous calendar events parseable from it.
* Should be possible for authors to find out if people are reusing the information on their site.
* Code should not be ugly (e.g. should not be mixed in with markup used mostly for styling).
* There should be "obvious parsing tools for people to actually do anything with the data (other than add an event to a calendar)".
* Solution should not feel "disconnected" from the Web the way that calendar file downloads do.


* Finding more information about a movie when looking at a page about the
----
movie, when the page contains detailed data about the movie.


* Pages should be able to expose nested lists of name-value pairs on a
USE CASE: Exposing contact details so that users can add people to their address books or social networking sites.
page-by-page basis.


* It should be possible to define globally-unique names, but the syntax
SCENARIOS:
should be optimised for a set of predefined vocabularies.
* Instead of giving a colleague a business card, someone gives their colleague a URL, and that colleague's user agent extracts basic profile information such as the person's name along with references to other people that person knows and adds the information into an address book.
* A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about who he is to add it to their contact databases.
* Fred copies the names of one of his Facebook friends and pastes it into his OS address book; the contact information is imported automatically.
* Fred copies the names of one of his Facebook friends and pastes it into his Webmail's address book feature; the contact information is imported automatically.
* David can use the data in a web page to generate a custom browser UI for including a person in our address book without using brittle screen-scraping.


* Adding this data to a page should be easy.
REQUIREMENTS:
* A user joining a new social network should be able to identify himself to the new social network in way that enables the new social network to bootstrap his account from existing published data (e.g. from another social nework) rather than having to re-enter it, without the new site having to coordinate (or know about) the pre-existing site, without the user having to give either sites credentials to the other, and without the new site finding out about relationships that the user has intentionally kept secret. (http://w2spconf.com/2008/papers/s3p2.pdf)
* Data should not need to be duplicated between machine-readable and human-readable forms (i.e. the human-readable form should be machine-readable).
* Shouldn't require the consumer to write XSLT or server-side code to read the contact information.
* Machine-readable contact information shouldn't be on a separate page than human-readable contact information.
* The information should be convertible into a dedicated form (RDF, JSON, XML, vCard) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
* Should be possible for different parts of a contact to be given in different parts of the page. For example, a page with contact details for people in columns (with each row giving the name, telephone number, etc) should still have unambiguous grouped contact details parseable from it.


* The syntax for adding this data should encourage the data to remain
----
accurate when the page is changed.


* The syntax should be resilient to intentional copy-and-paste authoring:
USE CASE: Allow users to maintain bibliographies or otherwise keep track of sources of quotes or references.
people copying data into the page from a page that already has data
should not have to know about any declarations far from the data.


* The syntax should be resilient to unintentional copy-and-paste
SCENARIOS:
authoring: people copying markup from the page who do not know about
* Frank copies a sentence from Wikipedia and pastes it in some word processor: it would be great if the word processor offered to automatically create a bibliographic entry.
these features should not inadvertently mark up their page with
* Patrick keeps a list of his scientific publications on his web site. He would like to provide structure within this publications page so that Frank can automatically extract this information and use it to cite Patrick's papers without having to transcribe the bibliographic information.
inapplicable data.
* A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he has published to add it to their bibliographic applications.
* A scholar and teacher wants to publish scholarly documents or content that includes extensive citations that readers can then automatically extract so that they can find them in their local university library. These citations may be for a wide range of different sources: an interview posted on YouTube, a legal opinion posted on the Supreme Court web site, a press release from the White House.
* A blog, say ''htmlfive.net'', copies content wholesale from another, say ''blog.whatwg.org'' (as permitted and encouraged by the license). The author of the original content would like the reader of the reproduced content to know the provenance of the content. The reader would like to find the original blog post so he can leave comments for the original author.
* Chaals could improve the Opera intranet if he had a mechanism for identifying the original source of various parts of a page, as that would let him contact the original author quickly to report problems or request changes.


* Generic syntax and parsing mechanism for Microformats
REQUIREMENTS:
* Machine-readable bibliographic information shouldn't be on a separate page than human-readable bibliographic information.
* The information should be convertible into a dedicated form (RDF, JSON, XML, BibTex) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.


* A standard way to include arbitrary data in a web page and extract
----
it for machine processing, without having to pre-coordinate their
data models.


USE CASE: Help people searching for content to find content covered by licenses that suit their needs.


> Site owners want a way to provide enhanced search results to the
SCENARIOS:
> engines, so that an entry in the search results page is more than just
* If a user is looking for recipes of pies to reproduce on his blog, he might want to exclude from his results any recipes that are not available under a license allowing non-commercial reproduction.
> a bare link and snippet of text, and provides additional resources for
* Lucy wants to publish her papers online. She includes an abstract of each one in a page, but because they are under different copyright rules, she needs to clarify what the rules are. A harvester such as the Open Access project can actually collect and index some of them with no problem, but may not be allowed to index others. Meanwhile, a human finds it more useful to see the abstracts on a page than have to guess from a bunch of titles whether to look at each abstract.
> users straight on the search page without them having to click into
* There are mapping organisations and data producers and people who take photos, and each may place different policies. Being able to keep that policy information helps people with further mashups avoiding violating a policy. For example, if GreatMaps.com has a public domain policy on their maps, CoolFotos.org has a policy that you can use data other than images for non-commercial purposes, and Johan Ichikawa has a photo there of my brother's café, which he has licensed as "must pay money", then it would be reasonable for me to copy the map and put it in a brochure for the café, but not to copy the data and photo from CoolFotos. On the other hand, if I am producing a non-commercial guide to cafés in Melbourne, I can add the map and the location of the cafe photo, but not the photo itself.
> the page and discover those resources themselves.
* Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video.
>
* Fred's browser can tell him what license a particular video on a site he is reading has been released under, and advise him on what the associated permissions and restrictions are (can he redistribute this work for commercial purposes, can he distribute a modified version of this work, how should he assign credit to the original author, what jurisdiction the license assumes, whether the license allows the work to be embedded into a work that uses content under various other licenses, etc).
> For example (taken directly from the SearchMonkey docs), yelp.com may
* Flickr has images that are CC-licensed, but the pages themselves are not.
> want to provide additional information on restaurants they have
* Blogs may wish to reuse CC-licensed images without licensing the whole blog as CC, but while still including attribution and license information (which may be required by the licenses in question).
> reviews for, pushing info on price, rating, and phone number directly
> into the search results, along with links straight to their reviews or
> photos of the restaurant.
>
> Different sites will have vastly different needs and requirements in
> this regard, preventing natural discovery by crawlers from being
> effective.
>
> (SearchMonkey itself relies on the user registering an add-in on their
> Yahoo account, so spammers can't exploit this - the user has to
> proactively decide they want additional information from a site to
> show up in their results, then they click a link and the rest is
> automagical.)


REQUIREMENTS:
* Content on a page might be covered by a different license than other content on the same page.
* When licensing a subpart of the page, existing implementations must not just assume that the license applies to the whole page rather than just part of it.
* License proliferation should be discouraged.
* License information should be able to survive from one site to another as the data is transfered.
* Expressing copyright licensing terms should be easy for content creators, publishers, and redistributors to provide.
* It should be more convenient for the users (and tools) to find and evaluate copyright statements and licenses than it is today.
* Shouldn't require the consumer to write XSLT or server-side code to process the license information.
* Machine-readable licensing information shouldn't be on a separate page than human-readable licensing information.
* There should not be ambiguous legal implications.


Using SearchMonkey, developers and site owners can use structured data
== Annotations ==
to make Yahoo! Search results more useful and visually appealing, and
drive more relevant traffic to their sites.


USE CASE: Annotate structured data that HTML has no semantics for, and which nobody has annotated before, and may never again, for private use or use in a small self-contained community.


SCENARIOS:
* A group of users want to mark up their iguana collections so that they can write a script that collates all their collections and presents them in a uniform fashion.
* A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he teaches to add it to their custom applications.
* The list of specifications produced by W3C, for example, and various lists of translations, are produced by scraping source pages and outputting the result. This is brittle. It would be easier if the data was unambiguously obtainable from the source pages. This is a custom set of properties, specific to this community.
* Chaals wants to make a list of the people who have translated W3C specifications or other documents, and then use this to search for people who are familiar with a given technology at least at some level, and happen to speak one or more languages of interest.
* Chaals wants to have a reputation manager that can determine which of the many emails sent to the WHATWG list might be "more than usually valuable", and would like to seed this reputation manager from information gathered from the same source as the scraper that generates the W3C's TR/ page.
* A user wants to write a script that finds the price of a book from an Amazon page.
* Todd sells an HTML-based content management system, where all documents are processed and edited as HTML, sent from one editor to another, and eventually published and indexed. He would like to build up the editorial metadata used by the system within the HTML documents themselves, so that it is easier to manage and less likely to be lost.
* Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush, and has either convinced the people making the statements that they should use a common language-neutral machine-readable vocabulary to describe their thoughts, or has convinced some other people to come in after them and process the thoughts manually to get them into a computer-readable form.


REQUIREMENTS:
* Vocabularies can be developed in a manner that won't clash with future more widely-used vocabularies, so that those future vocabularies can later be used in a page making use of private vocabularies without making the earlier annotations ambiguous.
* Using the data should not involve learning a plethora of new APIs, formats, or vocabularies (today it is possible, e.g., to get the price of an Amazon product, but it requires learning a new API; similarly it's possible to get information from sites consistently using 'class' values in a documented way, but doing so requires learning a new vocabulary).
* Shouldn't require the consumer to write XSLT or server-side code to process the annotated data.
* Machine-readable annotations shouldn't be on a separate page than human-readable annotations.
* The information should be convertible into a dedicated form (RDF, JSON, XML) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
* Should be possible for different parts of an item's data to be given in different parts of the page, for example two items described in the same paragraph. ("The two lamps and A and B. The first is $20, the second $30. The first is 5W, the second 7W.")
* It should be possible to define globally-unique names, but the syntax should be optimised for a set of predefined vocabularies.
* Adding this data to a page should be easy.
* The syntax for adding this data should encourage the data to remain accurate when the page is changed.
* The syntax should be resilient to intentional copy-and-paste authoring: people copying data into the page from a page that already has data should not have to know about any declarations far from the data.
* The syntax should be resilient to unintentional copy-and-paste authoring: people copying markup from the page who do not know about these features should not inadvertently mark up their page with inapplicable data.
* Any additional markup or data used to allow the machine to understand the actual information shouldn't be redundantly repeated (e.g. on each cell of a table, when setting it on the column is possible).


----


> For example, if a person's name and contact details are marked up on a web
Validation of Expressed Micro-data
> page using hCard, the user-agent can offer to, say, add the person to your
> address book, or add them as a friend on a social networking site, or add a
> reminder about that person's birthday to your calendar.
>
> If an event is marked up on a web page using hCalendar, then the user-agent
> could offer to add it to a calendar, or provide the user with a map of its
> location, or add it to a timeline that the user is building for their school
> history project.
>
> Providing rich semantics for the information on a web page allows the
> user-agent to know what's on a page, and step in and perform helpful tasks for
> the user.
>
> a single unified parsing algorithm
>
> Separate parsers need to be created for hCalendar, hReview, hCard, etc,
> as each Microformat has its own unique parsing quirks. For example, hCard has
> N-optimisation and ORG-optimisation which aren't found in hCalendar. With
> RDFa, a single algorithm is used to parse everything: contacts, events,
> places, cars, songs, whatever.
>
> decentralised development is possible.
>
> If I want a way of marking up my iguana collection semantically, I can
> develop that vocabulary without having to go through a central
> authority. Because URIs are used to identify vocabulary terms, I can be
> sure that my vocabulary won't clash with other people's vocabularies. It
> can be argued that going through a community to develop vocabularies is
> beneficial, as it allows the vocabulary to be built by "many minds" -
> RDFa does not prevent this, it just gives people alternatives to
> community development.
>
> Lastly, there are a lot of parsing ambiguities for many Microformats.
>
> One area which is especially fraught is that of scoping. The editors of
> many current draft Microformats[1] would like to allow page authors to
> embed licensing data - e.g. to say that a particular recipe for a pie is
> licensed under a Creative Commons licence. However, it has been noted
> that the current rel=license Microformat can not be re-used within these
> drafts, because virtually all existing rel=license implementations will
> just assume that the license applies to the whole page rather than just
> part of it. RDFa has strong and unambiguous rules for scoping - a
> license, for example, could apply to a section of the page, or one
> particular image.
 
> As a trivial example, it
> would be useful to me in working to improve the Web content we produce at
> Opera to have a nice mechanism for identifying the original source of various
> parts of a page.
 
> Another use case is noting the source of data in mashups. This enables
> information to be carried about the licensing, the date at which the data was
> mashed (or smushed, to use the older terminology from the Semantic Web), and
> so on.
 
> Provide an easy mechanism to encode new data in a way that can be
> machine-extracted without requiring any explanation of the data model.
 
> Another example is that certain W3C pages (the list of specifications produced
> by W3C, for example, and various lists of translations) are produced from RDF
> data that is scraped from each page through a customised and thus fragile
> scraping mechanism. Being able to use RDFa would free authors of the draconian
> constraints on the source-code formatting of specifications, and merely
> require them to us the right attributes, in order to maintain this data.
 
> An example of how this data can be re-used is that it is possible to determine
> many of the people who have translated W3C specifications or other documents -
> and thus to search for people who are familiar with a given technology at
> least at some level, and happen to speak one or more languages of interest.
 
> Alternatively I could use the same information to seed a reputation
> manager, so I can determine which of the many emails I have no time to
> read in WHAT-WG might be more than usually valuable.
 
> Sure. In which case the problem becomes "doing mashups where data needs to
> have different metadata associated is impossible", so the requirement is 
> "enable mashups to carry different metadata about bits of the content that are
> from different sources.
 
 
> There are mapping organisations and data producers and people who take photos,
> and each may place different policies. Being able to keep that policy
> information helps people with further mashups avoiding violating a policy.
>
> For example, if GreatMaps.com has a public domain policy on their maps,
> CoolFotos.org has a policy that you can use data other than images for
> non-commercial purposes, and Johan Ichikawa has a photo there of my brother's
> café, which he has licensed as "must pay money", then it would be reasonable
> for me to copy the map and put it in a brochure for the café, but not to copy
> the data and photo from CoolFotos. On the other hand, if I am producing a
> non-commercial guide to cafés in Melbourne, I can add the map and the
> location of the cafe photo, but not the photo itself.
>
> Another use case:
> My wife wants to publish her papers online. She includes an abstract of each
> one in a page, but because they are under different copyright rules, she needs
> to clarify what the rules are. A harvester such as the Open Access project can
> actually collect and index some of them with no problem, but may not be
> allowed to index others. Meanwhile, a human finds it more useful to see the
> abstracts on a page than have to guess from a bunch of titles whether to look
> at each abstract.
 
 
> I) User agents must allow users to see that there are "semantic-links"
> (connections to semantically structured informations) in a HTML
> document/application. Consequently user agents must allow users to
> "follow" the semantic-link, (access/interact with the linked data,
> embedded or external) and this involves primarily the ability to:
> a) view the informations
> b) select the informations
> c) copy the informations in the clipboard
> d) drag and drop the informations
> e) send that informations to another web application (or to OS
> applications) selected by the user.
>
> II) User agents must allow users to "semantically annotate" an existing
> HTML document (insert a semantic link and linked data) and this involves
> primarily the ability to:
> a) editing the document to insert semantically structured informations
> (starting from the existing text or from information already structured
> in the edited portion of the page)
> b) send the result of the editing to another web application (or to OS
> applications) selected by the user.
 
> - Allow authors to embed annotations in HTML documents such that RDF
> triples can be unambiguously extracted from human-readable data without
> duplicating the data, and thus ensuring that the machine-readable data
> and the human-readable data remain in sync.


> One problem this can solve is that an agent can, given a URL that
USE CASE: It should be possible to write generalized validators and authoring tools for the annotations described in the previous use case.
> represents a person, extract some basic profile information such as the
> person's name along with references to other people that person knows.
> This can further be applied to allow a user who provides his own URL
> (for example, by signing in via OpenID) to bootstrap his account from
> existing published data rather than having to re-enter it.
>
> - Allow software agents to extract profile information for a person as
> often exposed on social networking sites from a page that "represents"
> that person.
>
>  There is a number of existing solutions for this:
>    * FOAF in RDF serialized as XML, Turtle, RDFa, eRDF, etc
>    * The vCard format
>    * The hCard microformat
>    * The PortableContacts protocol[3]
>    * Natural Language Processing of HTML documents
>
> - Allow software agents to determine who a person lists as their friends
> given a page that "represents" that person.
>
>  Again, there are competing solutions:
>    * FOAF in RDF serialized as XML, Turtle, RDFa, eRDF, etc
>    * The XFN microformat[4]
>    * The PortableContacts protocol[3]
>    * Natural Language Processing of HTML documents
>
> - Allow the above to be encoded without duplicating the data in both
> machine-readable and human-readable forms.


SCENARIOS:
* Mary would like to write a generalized software tool to help page authors express micro-data. One of the features that she would like to include is one that displays authoring information, such as vocabulary term description, type information, range information, and other vocabulary term attributes in-line so that authors have a better understanding of the vocabularies that they're using.
* John would like to ensure that his indexing software only stores type-valid data. Part of the mechanism that he uses to check the incoming micro-data stream is type information that is embedded in the vocabularies that he uses.
* Steve, would like to provide warnings to the authors that use his vocabulary that certain vocabulary terms are experimental and may never become stable.


> I'm a scholar and teacher. I want to be able to add structured data to
REQUIREMENTS:
> my web site to denote who I am, what I have published, and what I
* There should be a definitive location for vocabularies.
> teach in such a way that other scholars (and potentially students) can
* It should be possible for vocabularies to describe other vocabularies.
> easily extract that information to add it to their contact databases,
* Originating vocabulary documents should be discoverable.
> or to their bibliographic applications, or whatever. This involves
* Machine-readable vocabulary information shouldn't be on a separate page than the human-readable explanation.
> contact data, for sure, but also other, domain specific, data, as
* There must not be restrictions on the possible ways vocabularies can be expressed (e.g. the way DTDs restricted possible grammars in SGML).
> well, and so presumes a flexible and extensible model and syntax.
>
> I might also want to publish scholarly documents or content that
> includes extensive citations that readers can then automatically
> extract or otherwise process (say, have a little extension in a
> browser that helps me find them in my local university library). These
> source may be from a wide range of different sources: an interview
> posted on YouTube, a legal opinion posted on the Supreme Court web
> site, a press release from the White House.
>
> Again, the data may well involve some basic properties of the sort
> that you might see in Dublin Core (title, date, creator, etc.), but it
> also will include more domain-specific data (information about court
> reporters, case numbers, etc.).
>
> The use cases for RDF and RDFa are really that basic. Structured data
> is going to be increasingly important to the practical work that
> happens around the web, and an extensible system is essential to
> realizing that real world potential.


----


> Service and product provider can't include the meaning of the things
USE CASE: Allow authors to annotate their documents to highlight the key parts, e.g. as when a student highlights parts of a printed page, but in a hypertext-aware fashion.
> they publish in HTML (how do you find out where the price of a book is
> located in, say, an Amazon page?) - people that wanna use this data
> are forced to perform screen scraping (that is, the need for
> publisher-intended rather than consumer-guessed semantics).


> People doing (data) mash-ups need to learn a plethora of
SCENARIOS:
> APIs/formats while all they might want is one data format and a
* Fred writes a page about Napoleon. He can highlight the word Napoleon in a way that indicates to the reader that that is a person. Fred can also annotate the page to indicate that Napoleon and France are related concepts.
> bunch of vocabularies


> expressing machine-readable copyright licensing terms and related
== Search ==
> information; in a way that is both easy for content creators and
> publishers to provide, and more convenient for the users (and tools)
> to consume, extend, and redistribute


USE CASE: Site owners want a way to provide enhanced search results to the engines, so that an entry in the search results page is more than just a bare link and snippet of text, and provides additional resources for users straight on the search page without them having to click into the page and discover those resources themselves.


* I want to express structured data (who-am-i, who-do-i-know,
SCENARIOS:
  how-do-you-contact-me, this-page describes_a Molecule & Molecule's
* For example, in response to a query for a restaurant, a search engine might want to have the result from yelp.com provide additional information, e.g. info on price, rating, and phone number, along with links to reviews or photos of the restaurant.
  name_is "Carbon")


* I want to provide a human readable interpretation of my data
REQUIREMENTS:
* Information for the search engine should be on the same page as information that would be shown to the user if the user visited the page.


* I want to provide a machine readable interpretation of my data
----


* I do not want to write XSLT, or server side code to transform my
USE CASE: Search engines and other site categorisation and aggregation engines should be able to determine the contents of pages with more accuracy than today.
  data if I don't have to


* I do not want to have two urls with the exact same information, one
SCENARIOS
  for humans and one for robots
* Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging.
* A blogger wishes to categorise his posts such that he can see them in the context of other posts on the same topic, including posts by unrelated authors (i.e. not via a pre-agreed tag or identifier, not via a single dedicated and preconfigured aggregator).
* A user whose grandfather is called "Napoleon"  wishes to ask Google the question "Who is Napoleon", and get as his answer a page describing his grandfather.
* A user wants to ask about "Napoleon" but, instead of getting an answer, wants the search engine to ask him ''which'' Napoleon he wants to know about.


* Wikipedia - why should Wikipedia and DBPedia have to exist? Why
REQUIREMENTS:
  aren't they the same thing?
* Should not disadvantage pages that are more useful to the user but that have not made any effort to help the search engine.
* Should not be more susceptible to spamming than today's markup.


* Any government site, scientific journal, or document which describes
----
  data which is more complicated than a TABLE tag can express - how
  come I'm living in a cut and paste generation, and cannot collect
  this information for later use? How come a search engine can't do it
  for me?


* TV guide listings - how come my browser is too stupid to collect
USE CASE: Web browsers should be able to help users find information related to the items discussed by the page that they are looking at.
  facts about when a TV show is on from a webpage when I'm staring at
  it? Why would I have to subscribe to an RSS or icalendar feed all
  with different odds and ends and formatting; for something as simple
  as that; if the people providing TV guides actually provide such a
  thing? More importantly, if the tv guide provider does not render a
  link to IMDB; how come I can't have a nice extension to my browser
  which recognises TV shows and gives me implicit links which works on
  all sites who buy into the tv-show vocabulary?


SCENARIOS:
* Finding more information about a movie when looking at a page about the movie, when the page contains detailed data about the movie.
** For example, where the movie is playing locally.
** For example, what your friends thought of it.
* Exposing music samples on a page so that a user can listen to all the samples.
* Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging.
* David can use the data in a web page to generate a custom browser UI for calling a phone number using our cellphone without using brittle screen-scraping.


> Problem: I need to assert that student blog X is part of course
REQUIREMENTS:
> Y. (Current workaround: ugly tags/categories on posts like
* Should be discoverable, because otherwise users will not use it, and thus users won't be helped.
> "engl101spr09sect05" (Notice the failure as soon as you look at more
* Should be consistently available, because if it only works on some pages, users will not use it (see, for instance, the rel=next story).
> than one university.) Some people have started calling these
* Should be bootstrapable (rel=next failed because UAs didn't expose it because authors didn't use it because UAs didn't expose it).
> "functional tags" -- a signal that people are collectively looking for
> some solution.) Compare also the tidier machine tags in flickr and
> elsewhere, or for: tags in delicious.
>
> At University of Mary Washington, many faculty encourage students to
> blog about their studies to encourage more discussion using our
> instance of WordPress MultiUser. And so a student might have a blog,
> and be writing posts relevant to more than one class. The professor
> then aggregates the relevant posts into one blog with a plugin like
> FeedWordpress, based on an agreed-upon tag like what I described. I
> think this might shed more light on your points (b) and (c), as well
> as the bigger question from Ian of why someone would do this.


At a recent unconference on using WordPress in a teaching environment,
----
one of the big issues that the group of about 60 students and teachers
identified was a need for ways to help students and teachers discover
each other -- both within an institution and across institutions --
via their blogging [...]


Problem: I want to organize my blogs posts into topics (Current
USE CASE: Finding distributed comments on audio and video media.
workaround: Tags and categories. Sometimes works great within a blog,
but the inherent ambiguity makes it impossible to work reliably across
blogs. (DBpedia and MOAT are our friends here))


SCENARIOS:
* Sam has posted a video tutorial on how to grow tomatoes on his video blog. Jane uses the tutorial and would like to leave feedback to others that view the video regarding certain parts of the video she found most helpful. Since Sam has comments disabled on his blog, his users cannot comment on the particular sections of the video other than linking to it from their blog and entering the information there. Jane uses a video player that aggregates all the comments about the video found on the Web, and displays them as subtitles while she watches the video.


Problem: I need to know the provenance of this blog post. This comes
REQUIREMENTS:
up in the increasing frequency of reposting blog content from their
* It shouldn't be possible for Jane to be exposed to spam comments.
RSS feeds through things like the FeedAPI module for Drupal or the
* The comment-aggregating video player shouldn't need to crawl the entire Web for each user independently.
FeedWordpress and WP-O-Matic plugins for WP. I think that the ATOM
spec calls for identifiers for content, but that that info doesn't get
reproduced in the repostings. If a URI were part of the content, and
that content can reliably be reposted -- specs are good here --
problem solved (well, not quite, but closer!).


----


>    "I have an account on social networking site A. I go to a new social
USE CASE: Allow users to price-check digital media (music, TV shows, etc) and purchase such content without having to go through a special website or application to acquire it, and without particular retailers being selected by the content's producer or publisher.
>    networking site B. I want to be able to automatically add all my
>    friends from site A to site B."
>
> There are presumably other requirements, e.g. "site B must not ask the
> user for the user's credentials for site A" (since that would train
> people to be susceptible to phishing attacks). Also, "site A must not
> publish the data in a manner that allows unrelated users to obtain
> privacy-sensitive data about the user", for example we don't want to let
> other users determine relationships that the user has intentionally kept
> secret [1].
>
> [1] http://w2spconf.com/2008/papers/s3p2.pdf


> For example, if I copy a sentence from Wikipedia and paste it in some
SCENARIOS:
> word processor, it would be great if the word processor offered to  
* Joe wants to sell his music, but he doesn't want to sell it through a specific retailer, he wants to allow the user to pick a retailer. So he forgoes the chance of an affiliate fee, negotiates to have his music available in all retail stores that his users might prefer, and then puts a generic link on his page that identifies the product but doesn't identifier a retailer. Kyle, a fan, visits his page, clicks the link, and Amazon charges his credit card and puts the music into his Amazon album downloader. Leo instead clicks on the link and is automatically charged by Apple, and finds later that the music is in his iTunes library.
> automatically create a bibliographic entry.
* Manu wants to go to Joe's website but check the price of the offered music against the various retailers that sell it, without going to those retailers' sites, so that he can pick the cheapest retailer.
* David can use the data in a web page to generate a custom browser UI for buying a song from our favorite online music store without using brittle screen-scraping.


> If I copy the name of one of my Facebook "friends" and paste it into
REQUIREMENTS:
> my OS address book, it would be cool if the contact information was
* Should not be easily prone to clickjacking (sites shouldn't be able to charge the user without the user's consent).
> imported automatically. Or maybe I pasted it in my webmail's address
* Should not make transactions harder when the user hasn't yet picked a favourite retailer.
> book feature, and the same import operation happened..


> If I select an E-mail in my webmail and copy it, it would be awesome
----
> if my desktop mail client would just import the full E-mail with
> complete headers and different parts if I just switch to the mail
> client app and paste.


> 1. Service and product provider can't include the meaning of the things they
USE CASE: Allow the user to perform vertical searches across multiple sites even when the sites don't include the information the user wants.
> publish in HTML. For example, how do you find out where the price of a book
> is located in, say, a page from Amazon? Now, people that want to use this 
> data are forced to perform *screen scraping*, that is, there is a need for 
> publisher-push rather than consumer-pull semantics.


> 2. People doing data mash-ups need to learn a plethora of APIs/formats while
SCENARIOS:
> all they would likely want is *one data model* + and a bunch of vocabularies
* Kjetil is searching for new hardware for his desktop and most of the specs he does not care about too much, but he's decided that he wants a 45 nm CPU with at least a 1333 MHz FSB and at least 2800 MHz clock frequency, and a thermal energy of at most 65 W. The motherboard needs to have at least 2 PCI ports, unless it has an onboard Wifi card, and it needs to accommodate for at least 12 GB of DDR3 RAM, which needs to match the FSB frequency. Furthermore, all components should be well supported by Linux and the RAID controller should have at least RAID acceleration. None of the manufacturer sites have information about the RAID controllers, that information is only available form various forums.
> covering the domain.
* Fred is going to buy a property. The property needs to be close to the forest, yet close to a train station that will take him to town in less than half an hour. It needs to have a stable snow-fall in the winter, and access to tracks that are regularly prepared for XC skating. The property should be of a certain size, and proximity to kindergarten and schools. It needs to have been regulated for residential use and have roads and the usual infrastructure. Furthermore, it needs to be on soil that is suitable for geothermal heating yet have a low abundance of uranium. It should have a good view of the fjord to the southeast.


> When writing HTML (by hand or indirectly via a program) I want to
REQUIREMENTS:
> isolate at describe what the content is about in terms of people,
* Performing search searches should be feasible and cheap.
> places, and other real-world things. I want to isolate "Napoleon" from a
* It should be possible to perform such searches without relying on a third-party to seek out the information.
> paragraph or heading, and state that the aforementioned entity is:  is
* The tool that collects information must not require the information to be marked up in some special way, since manufacturers don't include all the information, and users on forums (where the information can sometimes be found) are unlikely to mark it up in some particularly machine-readable way.
> of type "Person" and he is associated with another entity "France".
>
> The use-case above is like taking a highlighter and making notes while
> reading about "Napoleon". This is what we all do when studying, but when
> we were kids, we never actually shared that part of our endeavors since
> it was typically the route to competitive advantage i.e., being top
> student in the class.


> Simple programs should thus be able to answer questions like:
== Cross-site communication ==
>  * Under what license has a copyright holder released her
>    work, and what are the associated permissions and
>    restrictions?
>   
>  * Can I redistribute this work for commercial purposes?
>  * Can I distribute a modified version of this work?
>  * How should I assign credit to the original author?   


Paul maintains a blog and wishes to "mark up" his existing page with
USE CASE: Copy-and-paste should work between Web apps and native apps and between Web apps and other Web apps.
structure so that tools can pick up his blog post tags, authors,
titles, and his blogroll, and so that he does not need to maintain a
parallel version of his data in "structured format." His HTML blog
should be usable as its own structured feed.


Paul sometimes gives talks on various topics, and announces them on
SCENARIOS:
his blog. He would like to mark up these announcements with proper
* Fred copies an e-mail from Apple Mail into GMail, and the e-mail survives intact, including headers, attachments, and multipart/related parts.
scheduling information, so that his readers' software can
* Fred copies an e-mail from GMail into Hotmail, and the e-mail survives intact, including headers, attachments, and multipart/related parts.
automatically obtain the scheduling information and add it to their
calendar. Importantly, some of the rendered data might be more
informal than the machine-readable data required to produce a calendar
event. Also of importance: Paul may want to annotate his event with a
combination of existing vocabularies and a new vocabulary of his own
design.


Tod sells an HTML-based content management system, where all documents
----
are processed and edited as HTML, sent from one editor to another, and
eventually published and indexed. He would like to build up the
editorial metadata within the HTML document itself, so that it is
easier to manage and less likely to be lost.


Tara runs a video sharing web site. When Paul wants to blog about a
USE CASE: Allow users to share data between sites (e.g. between an online store and a price comparison site).
video, he can paste a fragment of HTML provided by Tara directly into
his blog. The video is then available inline, in his blog, along with
any licensing information (Creative Commons?) about the video.


Lucy is looking for a new apartment and some items with which to
SCENARIOS
furnish it. She browses various web pages, including apartment
* Lucy is looking for a new apartment and some items with which to furnish it. She browses various web pages, including apartment listings, furniture stores, kitchen appliances, etc. Every time she finds an item she likes, she points to it and transfers its details to her apartment-hunting page, where her picks can be organized, sorted, and categorized.
listings, furniture stores, kitchen appliances, etc. Every time she
* Lucy uses a website called TheBigMove.com to organize all aspects of her move, including items that she is tracking for the move. She goes to her "To Do" list and adds some of the items she collected during her visits to various Web sites, so that TheBigMove.com can handle the purchasing and delivery for her.
finds an item she likes, she can point to it, extract the
locally-relevant structured data, and transfer it to her
apartment-hunting page, where it can be organized, sorted, and
categorized. Extracting relevant information from web pages is a still
a very manual process. Unless a particular site allows you to add
items to a shopping cart, or a "favorites list", it is very difficult
to store relevant details for later use. The use of a web browser to
remember items from multiple sites is even more daunting, usually
resulting in dropping web tools in favor of desktop tools such as a
text editor. There is no reason why copying concepts to a web-based
clipboard should be so difficult - the idea has failed to gain
traction until now because there has not been an easy-to-implement
data model and mark-up mechanism allowing people to right-click and
store items into a semantic clipboard. Lucy could then use an website
called TheBigMove.com to organize all aspects of her move, including
items that she is tracking for the move. She would go to her "To Do"
list and add the semantic objects she had cut from other places. To
ensure that sites don't try and steal any of her web clipboard
objects, she would be required to click a browser-activated button
labeled "Upload Web Objects", which would ask her which web objects
she would like to share with the web page.


Patrick keeps a list of his scientific publications on his web
REQUIREMENTS:
site. Using the BibTex vocabulary, he would like to provide structure
* Should be discoverable, because otherwise users will not use it, and thus users won't be helped.
within this publications page so that Ulrich, who browses the web with
* Should be consistently available, because if it only works on some pages, users will not use it (see, for instance, the rel=next story).
an RDFa-aware client, can automatically extract this information and
* Should be bootstrapable (rel=next failed because UAs didn't expose it because authors didn't use it because UAs didn't expose it).
use it to cite Patrick's papers without having to transcribe the
* The information should be convertible into a dedicated form (RDF, JSON, XML) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
bibliographic information.


A mechanism to mark up music, video and other digital content in a
blog or website. The Bitmunk Firefox plug-in would then detect the
purchase information required from the embedded meta-data in the same
web page that the browser is viewing. For example, while browsing the
Scissorkick website, it would be nice to be able to purchase the music
directly from one's favorite online music store without leaving the
page. Marking up the music information in a way that works across
websites would hopefully help drive a universal set of tools to enable
this use case.


It's difficult to price-check music and purchase it without having to
== Blogging ==
go through a special website or application to acquire music.


A mechanism to annotate which proteins and genes one is referencing in
USE CASE: Remove the need for feeds to restate the content of HTML pages (i.e. replace Atom with HTML).
a blog entry, so that colleagues can determine if they are talking
about the same thing without having to read long series of numbers (or
whatnot).


Paul wants to publish a large vocabulary in RDFS and/or OWL.  Paul
SCENARIOS:
also wants to provide a clear, human readable description of the same
* Paul maintains a blog and wishes to write his blog in such a way that tools can pick up his blog post tags, authors, titles, and his blogroll directly from his blog, so that he does not need to maintain a parallel version of his data in a "structured format." In other words, his HTML blog should be usable as its own structured feed.  
vocabulary, that mixes the terms with descriptive text in HTML.


As a browser interface developer, I find it really annoying that I
----
have to keep creating new screen scrapers for websites in order to
build UIs that work with page data differently than the page developer
intended. The data is all there on the page, but it takes a great
amount of effort to extract it into a usable form. Even worse, the
screen scraper breaks whenever a major update is made to the page,
requiring me to solve the scraping problem yet again. Microformats
were a step in the right direction, but I keep having to create a new
parser and special rules for every new Microformat that is
created. Every time I develop a new parser it takes precious time away
from making the browser actually useful. Can we create a world where
we don't have to worry about the data model anymore and instead focus
on the UI? Browser UIs for working with web page data suck. We can add
an RSS feed and bookmark a page, but many other more complex tasks
force us to tediously cut and paste text instead of working with the
information on the page directly. It would increase productivity and
reduce frustration for many people if we could use the data in a web
page to generate a custom browser UI for calling a phone number using
our cellphone, adding an event to our calendaring software, including
a person in our address book, or buying a song from our favorite
online music store.


How do you merge statements made in multiple languages about a single
USE CASE: Allow users to compare subjects of blog entries when the subjects are hard to tersely identify relative to other subjects in the same general area.
subject or topic into a particular knowledge base? If there are a
number of thoughts about George W. Bush made by people that speak
Spanish and there are a number of statements made by people that speak
English, how do you coalesce those statements into one knowledge base?
How do you differentiate those statements from statements made about
George H.W. Bush? One approach would be to use a similar underlying
vocabulary to describe each person and specify the person using a
universally unique string. This would allow the underlying language to
change, but ensure that the semantics of what is being expressed stays
the same.


When Google answers a search like "Who is Napoleon" you get an answer,
SCENARIOS:
but where is the disambiguation? How does it determine the context for
* Paul blogs about proteins and genes. His colleagues also blog about proteins and genes. Proteins and genes are identified by long hard-to-compare strings, but Paul and his colleagues can determine if they are talking about the same things by having their user agent compare some sort of flags embedded in the blogs.
the search? There are many dimensions to "Napoleon" and Google
* Rob wants to publish a large vocabulary in RDFS and/or OWL. Rob also wants to provide a clear, human readable description of the same vocabulary, that mixes the terms with descriptive text in HTML.
statistically guessed one based on link density of its subjectively
assembled index and page rank algorithm. How do you as writer or
reader efficiently navigate the many aspects/facets associated with
the pattern: "Napoleon"? What if the answer you are looking for is in
the statistically insignificant links and not the major links?


Sam has posted a video tutorial on how to grow tomatoes on his video
----
blog. Jane uses the tutorial and would like to leave feedback to
others that view the video regarding certain parts of the video she
found most helpful. Since Sam has comments disabled on his blog, Jane
cannot comment on the particular sections of the video other than
linking to it from her blog and entering the information there. This
is not useful to most people viewing the video as they would have to
go to every blogger's site to read each comment. Luckily, Jane has a
video player that is capable of finding comments distributed around
blogs on the net. The video player shows the comments as a video is
being watched (shown as sub-titles). How can Jane specify her comments
on parts of a video in a distributed manner?


USE CASE: Allow blogs to be aggregated along subject lines.


== reqs ==
SCENARIOS:
* At University of Mary Washington, many faculty encourage students to blog about their studies to encourage more discussion using an instance of WordPress MultiUser. A student with have a blog might be writing posts relevant to more than one class. Professors would like to then aggregate relevant posts into one blog.


* Arbitrarily extensible by authors
== Data extraction from sites without explicit cooperation from those sources ==


* Mapping to RDF
USE CASE: Getting data out of poorly written Web pages, so that the user can find more information about the page's contents.


* Ability to create groups of name-value pairs (i.e. triples with a
SCENARIOS:
  common subject) without requiring that the name-value pairs be given
* Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. (In this instance, Alfred and Bob are assumed to be uncooperative, since creating a static mashup would be an example of a poorly-written page.)
  on elements with a common parent
* TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. (In this instance, it is assumed that the TV guide provider is uncooperative, since it isn't providing the links the user wants.)
* Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. (In this instance, it is assumed that the teachers and students aren't cooperative, since they would otherwise be able to find each other by listing their blogs in a common directory.)
* Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. (In this instance, it is assumed that the people writing the statements aren't cooperative, since if they were they could just add the data straight into the knowledge base.)


* Ability to have name-value pairs with values that are arbitrary
REQUIREMENTS:
  strings, dates and times, URIs, and further groups of name-value
* Does not need cooperation of the author (if the page author was cooperative, the page would be well-written).
  pairs
* Shouldn't require the consumer to write XSLT or server-side code to derive this information from the page.


* Encoding of machine-readable equivalents for times, lenths,
----
  durations, telephone numbers, languages, etc


* API
USE CASE: Remove the need for RDF users to restate information in online encyclopedias (i.e. replace DBpedia).


* Discourage data duplication (e.g. discourage people from saying
SCENARIOS:
  <title>...</title> <meta name="dc.title" content="..."> <h1
* A user wants to have information in RDF form. The user visits Wikipedia, and his user agent can obtain the information without relying on DBpedia's interpretation of the page.
  property="http://...dc...title">...</h1>


* The Microformats community has been struggling with the abbr design
REQUIREMENTS:
  pattern when attempting to specify certain machine-readable object
* All the data exposed by DBpedia should be derivable from Wikipedia without using DBpedia.
  attributes that differ from the human-readable content. For example,
  when specifying times, dates, weights, countries and other data in
  French, Japanese, or Urdu, it is helpful to use the ISO format to
  express the data to a machine and associate it with an object
  property, but to specify the human-readable value in the speaker's
  natural language.

Latest revision as of 21:38, 4 May 2009

Exposing known data types in a reusable way

USE CASE: Exposing calendar events so that users can add those events to their calendaring systems.

SCENARIOS:

  • A user visits the Avenue Q site and wants to make a note of when tickets go on sale for the tour's stop in his home town. The site says "October 3rd", so the user clicks this and selects "add to calendar", which causes an entry to be added to his calendar.
  • A student is making a timeline of important events in Apple's history. As he reads Wikipedia entries on the topic, he clicks on dates and selects "add to timeline", which causes an entry to be added to his timeline.
  • TV guide listings - browsers should be able to expose to the user's tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
  • Paul sometimes gives talks on various topics, and announces them on his blog. He would like to mark up these announcements with proper scheduling information, so that his readers' software can automatically obtain the scheduling information and add it to their calendar. Importantly, some of the rendered data might be more informal than the machine-readable data required to produce a calendar event. Also of importance: Paul may want to annotate his event with a combination of existing vocabularies and a new vocabulary of his own design. (why?)
  • David can use the data in a web page to generate a custom browser UI for adding an event to our calendaring software without using brittle screen-scraping.
  • http://livebrum.co.uk/: the author would like people to be able to grab events and event listings from his site and put them on their site with as much information as possible retained. "The fantasy would be that I could provide code that could be cut and pasted into someone else's HTML so the average blogger could re-use and re-share my data."
  • User should be able to subscribe to http://livebrum.co.uk/ then sort by date and see the items sorted by event date, not publication date.

REQUIREMENTS:

  • Should be discoverable.
  • Should be compatible with existing calendar systems.
  • Should be unlikely to get out of sync with prose on the page.
  • Shouldn't require the consumer to write XSLT or server-side code to read the calendar information.
  • Machine-readable event data shouldn't be on a separate page than human-readable dates.
  • The information should be convertible into a dedicated form (RDF, JSON, XML, iCalendar) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
  • Should be possible for different parts of an event to be given in different parts of the page. For example, a page with calendar events in columns (with each row giving the time, date, place, etc) should still have unambiguous calendar events parseable from it.
  • Should be possible for authors to find out if people are reusing the information on their site.
  • Code should not be ugly (e.g. should not be mixed in with markup used mostly for styling).
  • There should be "obvious parsing tools for people to actually do anything with the data (other than add an event to a calendar)".
  • Solution should not feel "disconnected" from the Web the way that calendar file downloads do.

USE CASE: Exposing contact details so that users can add people to their address books or social networking sites.

SCENARIOS:

  • Instead of giving a colleague a business card, someone gives their colleague a URL, and that colleague's user agent extracts basic profile information such as the person's name along with references to other people that person knows and adds the information into an address book.
  • A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about who he is to add it to their contact databases.
  • Fred copies the names of one of his Facebook friends and pastes it into his OS address book; the contact information is imported automatically.
  • Fred copies the names of one of his Facebook friends and pastes it into his Webmail's address book feature; the contact information is imported automatically.
  • David can use the data in a web page to generate a custom browser UI for including a person in our address book without using brittle screen-scraping.

REQUIREMENTS:

  • A user joining a new social network should be able to identify himself to the new social network in way that enables the new social network to bootstrap his account from existing published data (e.g. from another social nework) rather than having to re-enter it, without the new site having to coordinate (or know about) the pre-existing site, without the user having to give either sites credentials to the other, and without the new site finding out about relationships that the user has intentionally kept secret. (http://w2spconf.com/2008/papers/s3p2.pdf)
  • Data should not need to be duplicated between machine-readable and human-readable forms (i.e. the human-readable form should be machine-readable).
  • Shouldn't require the consumer to write XSLT or server-side code to read the contact information.
  • Machine-readable contact information shouldn't be on a separate page than human-readable contact information.
  • The information should be convertible into a dedicated form (RDF, JSON, XML, vCard) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
  • Should be possible for different parts of a contact to be given in different parts of the page. For example, a page with contact details for people in columns (with each row giving the name, telephone number, etc) should still have unambiguous grouped contact details parseable from it.

USE CASE: Allow users to maintain bibliographies or otherwise keep track of sources of quotes or references.

SCENARIOS:

  • Frank copies a sentence from Wikipedia and pastes it in some word processor: it would be great if the word processor offered to automatically create a bibliographic entry.
  • Patrick keeps a list of his scientific publications on his web site. He would like to provide structure within this publications page so that Frank can automatically extract this information and use it to cite Patrick's papers without having to transcribe the bibliographic information.
  • A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he has published to add it to their bibliographic applications.
  • A scholar and teacher wants to publish scholarly documents or content that includes extensive citations that readers can then automatically extract so that they can find them in their local university library. These citations may be for a wide range of different sources: an interview posted on YouTube, a legal opinion posted on the Supreme Court web site, a press release from the White House.
  • A blog, say htmlfive.net, copies content wholesale from another, say blog.whatwg.org (as permitted and encouraged by the license). The author of the original content would like the reader of the reproduced content to know the provenance of the content. The reader would like to find the original blog post so he can leave comments for the original author.
  • Chaals could improve the Opera intranet if he had a mechanism for identifying the original source of various parts of a page, as that would let him contact the original author quickly to report problems or request changes.

REQUIREMENTS:

  • Machine-readable bibliographic information shouldn't be on a separate page than human-readable bibliographic information.
  • The information should be convertible into a dedicated form (RDF, JSON, XML, BibTex) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.

USE CASE: Help people searching for content to find content covered by licenses that suit their needs.

SCENARIOS:

  • If a user is looking for recipes of pies to reproduce on his blog, he might want to exclude from his results any recipes that are not available under a license allowing non-commercial reproduction.
  • Lucy wants to publish her papers online. She includes an abstract of each one in a page, but because they are under different copyright rules, she needs to clarify what the rules are. A harvester such as the Open Access project can actually collect and index some of them with no problem, but may not be allowed to index others. Meanwhile, a human finds it more useful to see the abstracts on a page than have to guess from a bunch of titles whether to look at each abstract.
  • There are mapping organisations and data producers and people who take photos, and each may place different policies. Being able to keep that policy information helps people with further mashups avoiding violating a policy. For example, if GreatMaps.com has a public domain policy on their maps, CoolFotos.org has a policy that you can use data other than images for non-commercial purposes, and Johan Ichikawa has a photo there of my brother's café, which he has licensed as "must pay money", then it would be reasonable for me to copy the map and put it in a brochure for the café, but not to copy the data and photo from CoolFotos. On the other hand, if I am producing a non-commercial guide to cafés in Melbourne, I can add the map and the location of the cafe photo, but not the photo itself.
  • Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video.
  • Fred's browser can tell him what license a particular video on a site he is reading has been released under, and advise him on what the associated permissions and restrictions are (can he redistribute this work for commercial purposes, can he distribute a modified version of this work, how should he assign credit to the original author, what jurisdiction the license assumes, whether the license allows the work to be embedded into a work that uses content under various other licenses, etc).
  • Flickr has images that are CC-licensed, but the pages themselves are not.
  • Blogs may wish to reuse CC-licensed images without licensing the whole blog as CC, but while still including attribution and license information (which may be required by the licenses in question).

REQUIREMENTS:

  • Content on a page might be covered by a different license than other content on the same page.
  • When licensing a subpart of the page, existing implementations must not just assume that the license applies to the whole page rather than just part of it.
  • License proliferation should be discouraged.
  • License information should be able to survive from one site to another as the data is transfered.
  • Expressing copyright licensing terms should be easy for content creators, publishers, and redistributors to provide.
  • It should be more convenient for the users (and tools) to find and evaluate copyright statements and licenses than it is today.
  • Shouldn't require the consumer to write XSLT or server-side code to process the license information.
  • Machine-readable licensing information shouldn't be on a separate page than human-readable licensing information.
  • There should not be ambiguous legal implications.

Annotations

USE CASE: Annotate structured data that HTML has no semantics for, and which nobody has annotated before, and may never again, for private use or use in a small self-contained community.

SCENARIOS:

  • A group of users want to mark up their iguana collections so that they can write a script that collates all their collections and presents them in a uniform fashion.
  • A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he teaches to add it to their custom applications.
  • The list of specifications produced by W3C, for example, and various lists of translations, are produced by scraping source pages and outputting the result. This is brittle. It would be easier if the data was unambiguously obtainable from the source pages. This is a custom set of properties, specific to this community.
  • Chaals wants to make a list of the people who have translated W3C specifications or other documents, and then use this to search for people who are familiar with a given technology at least at some level, and happen to speak one or more languages of interest.
  • Chaals wants to have a reputation manager that can determine which of the many emails sent to the WHATWG list might be "more than usually valuable", and would like to seed this reputation manager from information gathered from the same source as the scraper that generates the W3C's TR/ page.
  • A user wants to write a script that finds the price of a book from an Amazon page.
  • Todd sells an HTML-based content management system, where all documents are processed and edited as HTML, sent from one editor to another, and eventually published and indexed. He would like to build up the editorial metadata used by the system within the HTML documents themselves, so that it is easier to manage and less likely to be lost.
  • Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush, and has either convinced the people making the statements that they should use a common language-neutral machine-readable vocabulary to describe their thoughts, or has convinced some other people to come in after them and process the thoughts manually to get them into a computer-readable form.

REQUIREMENTS:

  • Vocabularies can be developed in a manner that won't clash with future more widely-used vocabularies, so that those future vocabularies can later be used in a page making use of private vocabularies without making the earlier annotations ambiguous.
  • Using the data should not involve learning a plethora of new APIs, formats, or vocabularies (today it is possible, e.g., to get the price of an Amazon product, but it requires learning a new API; similarly it's possible to get information from sites consistently using 'class' values in a documented way, but doing so requires learning a new vocabulary).
  • Shouldn't require the consumer to write XSLT or server-side code to process the annotated data.
  • Machine-readable annotations shouldn't be on a separate page than human-readable annotations.
  • The information should be convertible into a dedicated form (RDF, JSON, XML) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.
  • Should be possible for different parts of an item's data to be given in different parts of the page, for example two items described in the same paragraph. ("The two lamps and A and B. The first is $20, the second $30. The first is 5W, the second 7W.")
  • It should be possible to define globally-unique names, but the syntax should be optimised for a set of predefined vocabularies.
  • Adding this data to a page should be easy.
  • The syntax for adding this data should encourage the data to remain accurate when the page is changed.
  • The syntax should be resilient to intentional copy-and-paste authoring: people copying data into the page from a page that already has data should not have to know about any declarations far from the data.
  • The syntax should be resilient to unintentional copy-and-paste authoring: people copying markup from the page who do not know about these features should not inadvertently mark up their page with inapplicable data.
  • Any additional markup or data used to allow the machine to understand the actual information shouldn't be redundantly repeated (e.g. on each cell of a table, when setting it on the column is possible).

Validation of Expressed Micro-data

USE CASE: It should be possible to write generalized validators and authoring tools for the annotations described in the previous use case.

SCENARIOS:

  • Mary would like to write a generalized software tool to help page authors express micro-data. One of the features that she would like to include is one that displays authoring information, such as vocabulary term description, type information, range information, and other vocabulary term attributes in-line so that authors have a better understanding of the vocabularies that they're using.
  • John would like to ensure that his indexing software only stores type-valid data. Part of the mechanism that he uses to check the incoming micro-data stream is type information that is embedded in the vocabularies that he uses.
  • Steve, would like to provide warnings to the authors that use his vocabulary that certain vocabulary terms are experimental and may never become stable.

REQUIREMENTS:

  • There should be a definitive location for vocabularies.
  • It should be possible for vocabularies to describe other vocabularies.
  • Originating vocabulary documents should be discoverable.
  • Machine-readable vocabulary information shouldn't be on a separate page than the human-readable explanation.
  • There must not be restrictions on the possible ways vocabularies can be expressed (e.g. the way DTDs restricted possible grammars in SGML).

USE CASE: Allow authors to annotate their documents to highlight the key parts, e.g. as when a student highlights parts of a printed page, but in a hypertext-aware fashion.

SCENARIOS:

  • Fred writes a page about Napoleon. He can highlight the word Napoleon in a way that indicates to the reader that that is a person. Fred can also annotate the page to indicate that Napoleon and France are related concepts.

Search

USE CASE: Site owners want a way to provide enhanced search results to the engines, so that an entry in the search results page is more than just a bare link and snippet of text, and provides additional resources for users straight on the search page without them having to click into the page and discover those resources themselves.

SCENARIOS:

  • For example, in response to a query for a restaurant, a search engine might want to have the result from yelp.com provide additional information, e.g. info on price, rating, and phone number, along with links to reviews or photos of the restaurant.

REQUIREMENTS:

  • Information for the search engine should be on the same page as information that would be shown to the user if the user visited the page.

USE CASE: Search engines and other site categorisation and aggregation engines should be able to determine the contents of pages with more accuracy than today.

SCENARIOS

  • Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging.
  • A blogger wishes to categorise his posts such that he can see them in the context of other posts on the same topic, including posts by unrelated authors (i.e. not via a pre-agreed tag or identifier, not via a single dedicated and preconfigured aggregator).
  • A user whose grandfather is called "Napoleon" wishes to ask Google the question "Who is Napoleon", and get as his answer a page describing his grandfather.
  • A user wants to ask about "Napoleon" but, instead of getting an answer, wants the search engine to ask him which Napoleon he wants to know about.

REQUIREMENTS:

  • Should not disadvantage pages that are more useful to the user but that have not made any effort to help the search engine.
  • Should not be more susceptible to spamming than today's markup.

USE CASE: Web browsers should be able to help users find information related to the items discussed by the page that they are looking at.

SCENARIOS:

  • Finding more information about a movie when looking at a page about the movie, when the page contains detailed data about the movie.
    • For example, where the movie is playing locally.
    • For example, what your friends thought of it.
  • Exposing music samples on a page so that a user can listen to all the samples.
  • Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging.
  • David can use the data in a web page to generate a custom browser UI for calling a phone number using our cellphone without using brittle screen-scraping.

REQUIREMENTS:

  • Should be discoverable, because otherwise users will not use it, and thus users won't be helped.
  • Should be consistently available, because if it only works on some pages, users will not use it (see, for instance, the rel=next story).
  • Should be bootstrapable (rel=next failed because UAs didn't expose it because authors didn't use it because UAs didn't expose it).

USE CASE: Finding distributed comments on audio and video media.

SCENARIOS:

  • Sam has posted a video tutorial on how to grow tomatoes on his video blog. Jane uses the tutorial and would like to leave feedback to others that view the video regarding certain parts of the video she found most helpful. Since Sam has comments disabled on his blog, his users cannot comment on the particular sections of the video other than linking to it from their blog and entering the information there. Jane uses a video player that aggregates all the comments about the video found on the Web, and displays them as subtitles while she watches the video.

REQUIREMENTS:

  • It shouldn't be possible for Jane to be exposed to spam comments.
  • The comment-aggregating video player shouldn't need to crawl the entire Web for each user independently.

USE CASE: Allow users to price-check digital media (music, TV shows, etc) and purchase such content without having to go through a special website or application to acquire it, and without particular retailers being selected by the content's producer or publisher.

SCENARIOS:

  • Joe wants to sell his music, but he doesn't want to sell it through a specific retailer, he wants to allow the user to pick a retailer. So he forgoes the chance of an affiliate fee, negotiates to have his music available in all retail stores that his users might prefer, and then puts a generic link on his page that identifies the product but doesn't identifier a retailer. Kyle, a fan, visits his page, clicks the link, and Amazon charges his credit card and puts the music into his Amazon album downloader. Leo instead clicks on the link and is automatically charged by Apple, and finds later that the music is in his iTunes library.
  • Manu wants to go to Joe's website but check the price of the offered music against the various retailers that sell it, without going to those retailers' sites, so that he can pick the cheapest retailer.
  • David can use the data in a web page to generate a custom browser UI for buying a song from our favorite online music store without using brittle screen-scraping.

REQUIREMENTS:

  • Should not be easily prone to clickjacking (sites shouldn't be able to charge the user without the user's consent).
  • Should not make transactions harder when the user hasn't yet picked a favourite retailer.

USE CASE: Allow the user to perform vertical searches across multiple sites even when the sites don't include the information the user wants.

SCENARIOS:

  • Kjetil is searching for new hardware for his desktop and most of the specs he does not care about too much, but he's decided that he wants a 45 nm CPU with at least a 1333 MHz FSB and at least 2800 MHz clock frequency, and a thermal energy of at most 65 W. The motherboard needs to have at least 2 PCI ports, unless it has an onboard Wifi card, and it needs to accommodate for at least 12 GB of DDR3 RAM, which needs to match the FSB frequency. Furthermore, all components should be well supported by Linux and the RAID controller should have at least RAID acceleration. None of the manufacturer sites have information about the RAID controllers, that information is only available form various forums.
  • Fred is going to buy a property. The property needs to be close to the forest, yet close to a train station that will take him to town in less than half an hour. It needs to have a stable snow-fall in the winter, and access to tracks that are regularly prepared for XC skating. The property should be of a certain size, and proximity to kindergarten and schools. It needs to have been regulated for residential use and have roads and the usual infrastructure. Furthermore, it needs to be on soil that is suitable for geothermal heating yet have a low abundance of uranium. It should have a good view of the fjord to the southeast.

REQUIREMENTS:

  • Performing search searches should be feasible and cheap.
  • It should be possible to perform such searches without relying on a third-party to seek out the information.
  • The tool that collects information must not require the information to be marked up in some special way, since manufacturers don't include all the information, and users on forums (where the information can sometimes be found) are unlikely to mark it up in some particularly machine-readable way.

Cross-site communication

USE CASE: Copy-and-paste should work between Web apps and native apps and between Web apps and other Web apps.

SCENARIOS:

  • Fred copies an e-mail from Apple Mail into GMail, and the e-mail survives intact, including headers, attachments, and multipart/related parts.
  • Fred copies an e-mail from GMail into Hotmail, and the e-mail survives intact, including headers, attachments, and multipart/related parts.

USE CASE: Allow users to share data between sites (e.g. between an online store and a price comparison site).

SCENARIOS

  • Lucy is looking for a new apartment and some items with which to furnish it. She browses various web pages, including apartment listings, furniture stores, kitchen appliances, etc. Every time she finds an item she likes, she points to it and transfers its details to her apartment-hunting page, where her picks can be organized, sorted, and categorized.
  • Lucy uses a website called TheBigMove.com to organize all aspects of her move, including items that she is tracking for the move. She goes to her "To Do" list and adds some of the items she collected during her visits to various Web sites, so that TheBigMove.com can handle the purchasing and delivery for her.

REQUIREMENTS:

  • Should be discoverable, because otherwise users will not use it, and thus users won't be helped.
  • Should be consistently available, because if it only works on some pages, users will not use it (see, for instance, the rel=next story).
  • Should be bootstrapable (rel=next failed because UAs didn't expose it because authors didn't use it because UAs didn't expose it).
  • The information should be convertible into a dedicated form (RDF, JSON, XML) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information.


Blogging

USE CASE: Remove the need for feeds to restate the content of HTML pages (i.e. replace Atom with HTML).

SCENARIOS:

  • Paul maintains a blog and wishes to write his blog in such a way that tools can pick up his blog post tags, authors, titles, and his blogroll directly from his blog, so that he does not need to maintain a parallel version of his data in a "structured format." In other words, his HTML blog should be usable as its own structured feed.

USE CASE: Allow users to compare subjects of blog entries when the subjects are hard to tersely identify relative to other subjects in the same general area.

SCENARIOS:

  • Paul blogs about proteins and genes. His colleagues also blog about proteins and genes. Proteins and genes are identified by long hard-to-compare strings, but Paul and his colleagues can determine if they are talking about the same things by having their user agent compare some sort of flags embedded in the blogs.
  • Rob wants to publish a large vocabulary in RDFS and/or OWL. Rob also wants to provide a clear, human readable description of the same vocabulary, that mixes the terms with descriptive text in HTML.

USE CASE: Allow blogs to be aggregated along subject lines.

SCENARIOS:

  • At University of Mary Washington, many faculty encourage students to blog about their studies to encourage more discussion using an instance of WordPress MultiUser. A student with have a blog might be writing posts relevant to more than one class. Professors would like to then aggregate relevant posts into one blog.

Data extraction from sites without explicit cooperation from those sources

USE CASE: Getting data out of poorly written Web pages, so that the user can find more information about the page's contents.

SCENARIOS:

  • Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. (In this instance, Alfred and Bob are assumed to be uncooperative, since creating a static mashup would be an example of a poorly-written page.)
  • TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. (In this instance, it is assumed that the TV guide provider is uncooperative, since it isn't providing the links the user wants.)
  • Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. (In this instance, it is assumed that the teachers and students aren't cooperative, since they would otherwise be able to find each other by listing their blogs in a common directory.)
  • Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. (In this instance, it is assumed that the people writing the statements aren't cooperative, since if they were they could just add the data straight into the knowledge base.)

REQUIREMENTS:

  • Does not need cooperation of the author (if the page author was cooperative, the page would be well-written).
  • Shouldn't require the consumer to write XSLT or server-side code to derive this information from the page.

USE CASE: Remove the need for RDF users to restate information in online encyclopedias (i.e. replace DBpedia).

SCENARIOS:

  • A user wants to have information in RDF form. The user visits Wikipedia, and his user agent can obtain the information without relying on DBpedia's interpretation of the page.

REQUIREMENTS:

  • All the data exposed by DBpedia should be derivable from Wikipedia without using DBpedia.