Microdata Problem Descriptions

Improving the browsing experience

USE CASE: Web browsers should be able to help users find information related to the items discussed by the page that they are looking at.

Should be discoverable, because otherwise users will not use it, and thus users won't be helped
Should be consistently available, because if it only works on some pages, users will not use it (see, for instance, the rel=next story)
Should be bootstrapable (rel=next failed because UAs didn't expose it because authors didn't use it because UAs didn't expose it)

SCENARIOS:

Finding more information about a movie when looking at a page about the movie, when the page contains detailed data about the movie.
Exposing music samples on a page so that a user can listen to all the samples.
Exposing calendar events so that users can add those events to their calendaring systems.
- Should be compatible with existing calendar systems
- Should be unlikely to get out of sync with prose on the page

USE CASE: Getting data out of poorly written Web pages, so that the user can find more information about the page's contents.

REQUIREMENTS:

Does not need cooperation of the author (if the page author was cooperative, the page would be well-written).

Improving the search experience

USE CASE: Search engines should be able to determine the contents of pages with more accuracy than today.

Uncategorised

REQUIREMENTS:

A standard way to include arbitrary data in a web page and extract it for machine processing, without having to pre-coordinate their data models.

Site owners want a way to provide enhanced search results to the engines, so that an entry in the search results page is more than just a bare link and snippet of text, and provides additional resources for users straight on the search page without them having to click into the page and discover those resources themselves. For example (taken directly from the SearchMonkey docs), yelp.com may want to provide additional information on restaurants they have reviews for, pushing info on price, rating, and phone number directly into the search results, along with links straight to their reviews or photos of the restaurant. Different sites will have vastly different needs and requirements in this regard, preventing natural discovery by crawlers from being effective. (SearchMonkey itself relies on the user registering an add-in on their Yahoo account, so spammers can't exploit this - the user has to proactively decide they want additional information from a site to show up in their results, then they click a link and the rest is automagical.)

Using SearchMonkey, developers and site owners can use structured data to make Yahoo! Search results more useful and visually appealing, and drive more relevant traffic to their sites.

For example, if a person's name and contact details are marked up on a web page using hCard, the user-agent can offer to, say, add the person to your address book, or add them as a friend on a social networking site, or add a reminder about that person's birthday to your calendar. If an event is marked up on a web page using hCalendar, then the user-agent could offer to add it to a calendar, or provide the user with a map of its location, or add it to a timeline that the user is building for their school history project. Providing rich semantics for the information on a web page allows the user-agent to know what's on a page, and step in and perform helpful tasks for the user. a single unified parsing algorithm Separate parsers need to be created for hCalendar, hReview, hCard, etc, as each Microformat has its own unique parsing quirks. For example, hCard has N-optimisation and ORG-optimisation which aren't found in hCalendar. With RDFa, a single algorithm is used to parse everything: contacts, events, places, cars, songs, whatever. decentralised development is possible. If I want a way of marking up my iguana collection semantically, I can develop that vocabulary without having to go through a central authority. Because URIs are used to identify vocabulary terms, I can be sure that my vocabulary won't clash with other people's vocabularies. It can be argued that going through a community to develop vocabularies is beneficial, as it allows the vocabulary to be built by "many minds" - RDFa does not prevent this, it just gives people alternatives to community development. Lastly, there are a lot of parsing ambiguities for many Microformats. One area which is especially fraught is that of scoping. The editors of many current draft Microformats[1] would like to allow page authors to embed licensing data - e.g. to say that a particular recipe for a pie is licensed under a Creative Commons licence. However, it has been noted that the current rel=license Microformat can not be re-used within these drafts, because virtually all existing rel=license implementations will just assume that the license applies to the whole page rather than just part of it. RDFa has strong and unambiguous rules for scoping - a license, for example, could apply to a section of the page, or one particular image.

As a trivial example, it would be useful to me in working to improve the Web content we produce at Opera to have a nice mechanism for identifying the original source of various parts of a page.

Another use case is noting the source of data in mashups. This enables information to be carried about the licensing, the date at which the data was mashed (or smushed, to use the older terminology from the Semantic Web), and so on.

Provide an easy mechanism to encode new data in a way that can be machine-extracted without requiring any explanation of the data model.

Another example is that certain W3C pages (the list of specifications produced by W3C, for example, and various lists of translations) are produced from RDF data that is scraped from each page through a customised and thus fragile scraping mechanism. Being able to use RDFa would free authors of the draconian constraints on the source-code formatting of specifications, and merely require them to us the right attributes, in order to maintain this data.

An example of how this data can be re-used is that it is possible to determine many of the people who have translated W3C specifications or other documents - and thus to search for people who are familiar with a given technology at least at some level, and happen to speak one or more languages of interest.

Alternatively I could use the same information to seed a reputation manager, so I can determine which of the many emails I have no time to read in WHAT-WG might be more than usually valuable.

Sure. In which case the problem becomes "doing mashups where data needs to have different metadata associated is impossible", so the requirement is "enable mashups to carry different metadata about bits of the content that are from different sources.

There are mapping organisations and data producers and people who take photos, and each may place different policies. Being able to keep that policy information helps people with further mashups avoiding violating a policy. For example, if GreatMaps.com has a public domain policy on their maps, CoolFotos.org has a policy that you can use data other than images for non-commercial purposes, and Johan Ichikawa has a photo there of my brother's café, which he has licensed as "must pay money", then it would be reasonable for me to copy the map and put it in a brochure for the café, but not to copy the data and photo from CoolFotos. On the other hand, if I am producing a non-commercial guide to cafés in Melbourne, I can add the map and the location of the cafe photo, but not the photo itself. Another use case: My wife wants to publish her papers online. She includes an abstract of each one in a page, but because they are under different copyright rules, she needs to clarify what the rules are. A harvester such as the Open Access project can actually collect and index some of them with no problem, but may not be allowed to index others. Meanwhile, a human finds it more useful to see the abstracts on a page than have to guess from a bunch of titles whether to look at each abstract.

I) User agents must allow users to see that there are "semantic-links" (connections to semantically structured informations) in a HTML document/application. Consequently user agents must allow users to "follow" the semantic-link, (access/interact with the linked data, embedded or external) and this involves primarily the ability to: a) view the informations b) select the informations c) copy the informations in the clipboard d) drag and drop the informations e) send that informations to another web application (or to OS applications) selected by the user. II) User agents must allow users to "semantically annotate" an existing HTML document (insert a semantic link and linked data) and this involves primarily the ability to: a) editing the document to insert semantically structured informations (starting from the existing text or from information already structured in the edited portion of the page) b) send the result of the editing to another web application (or to OS applications) selected by the user.

- Allow authors to embed annotations in HTML documents such that RDF triples can be unambiguously extracted from human-readable data without duplicating the data, and thus ensuring that the machine-readable data and the human-readable data remain in sync.

One problem this can solve is that an agent can, given a URL that represents a person, extract some basic profile information such as the person's name along with references to other people that person knows. This can further be applied to allow a user who provides his own URL (for example, by signing in via OpenID) to bootstrap his account from existing published data rather than having to re-enter it. - Allow software agents to extract profile information for a person as often exposed on social networking sites from a page that "represents" that person. There is a number of existing solutions for this: * FOAF in RDF serialized as XML, Turtle, RDFa, eRDF, etc * The vCard format * The hCard microformat * The PortableContacts protocol[3] * Natural Language Processing of HTML documents - Allow software agents to determine who a person lists as their friends given a page that "represents" that person. Again, there are competing solutions: * FOAF in RDF serialized as XML, Turtle, RDFa, eRDF, etc * The XFN microformat[4] * The PortableContacts protocol[3] * Natural Language Processing of HTML documents - Allow the above to be encoded without duplicating the data in both machine-readable and human-readable forms.

I'm a scholar and teacher. I want to be able to add structured data to my web site to denote who I am, what I have published, and what I teach in such a way that other scholars (and potentially students) can easily extract that information to add it to their contact databases, or to their bibliographic applications, or whatever. This involves contact data, for sure, but also other, domain specific, data, as well, and so presumes a flexible and extensible model and syntax. I might also want to publish scholarly documents or content that includes extensive citations that readers can then automatically extract or otherwise process (say, have a little extension in a browser that helps me find them in my local university library). These source may be from a wide range of different sources: an interview posted on YouTube, a legal opinion posted on the Supreme Court web site, a press release from the White House. Again, the data may well involve some basic properties of the sort that you might see in Dublin Core (title, date, creator, etc.), but it also will include more domain-specific data (information about court reporters, case numbers, etc.). The use cases for RDF and RDFa are really that basic. Structured data is going to be increasingly important to the practical work that happens around the web, and an extensible system is essential to realizing that real world potential.

Service and product provider can't include the meaning of the things they publish in HTML (how do you find out where the price of a book is located in, say, an Amazon page?) - people that wanna use this data are forced to perform screen scraping (that is, the need for publisher-intended rather than consumer-guessed semantics).

People doing (data) mash-ups need to learn a plethora of APIs/formats while all they might want is one data format and a bunch of vocabularies

expressing machine-readable copyright licensing terms and related information; in a way that is both easy for content creators and publishers to provide, and more convenient for the users (and tools) to consume, extend, and redistribute

I want to express structured data (who-am-i, who-do-i-know, how-do-you-contact-me, this-page describes_a Molecule & Molecule's name_is "Carbon")

I want to provide a human readable interpretation of my data

I want to provide a machine readable interpretation of my data

I do not want to write XSLT, or server side code to transform my data if I don't have to

I do not want to have two urls with the exact same information, one for humans and one for robots

Wikipedia - why should Wikipedia and DBPedia have to exist? Why aren't they the same thing?

Any government site, scientific journal, or document which describes data which is more complicated than a TABLE tag can express - how come I'm living in a cut and paste generation, and cannot collect this information for later use? How come a search engine can't do it for me?

TV guide listings - how come my browser is too stupid to collect facts about when a TV show is on from a webpage when I'm staring at it? Why would I have to subscribe to an RSS or icalendar feed all with different odds and ends and formatting; for something as simple as that; if the people providing TV guides actually provide such a thing? More importantly, if the tv guide provider does not render a link to IMDB; how come I can't have a nice extension to my browser which recognises TV shows and gives me implicit links which works on all sites who buy into the tv-show vocabulary?

Problem: I need to assert that student blog X is part of course Y. (Current workaround: ugly tags/categories on posts like "engl101spr09sect05" (Notice the failure as soon as you look at more than one university.) Some people have started calling these "functional tags" -- a signal that people are collectively looking for some solution.) Compare also the tidier machine tags in flickr and elsewhere, or for: tags in delicious. At University of Mary Washington, many faculty encourage students to blog about their studies to encourage more discussion using our instance of WordPress MultiUser. And so a student might have a blog, and be writing posts relevant to more than one class. The professor then aggregates the relevant posts into one blog with a plugin like FeedWordpress, based on an agreed-upon tag like what I described. I think this might shed more light on your points (b) and (c), as well as the bigger question from Ian of why someone would do this.

At a recent unconference on using WordPress in a teaching environment, one of the big issues that the group of about 60 students and teachers identified was a need for ways to help students and teachers discover each other -- both within an institution and across institutions -- via their blogging [...]

Problem: I want to organize my blogs posts into topics (Current workaround: Tags and categories. Sometimes works great within a blog, but the inherent ambiguity makes it impossible to work reliably across blogs. (DBpedia and MOAT are our friends here))

Problem: I need to know the provenance of this blog post. This comes up in the increasing frequency of reposting blog content from their RSS feeds through things like the FeedAPI module for Drupal or the FeedWordpress and WP-O-Matic plugins for WP. I think that the ATOM spec calls for identifiers for content, but that that info doesn't get reproduced in the repostings. If a URI were part of the content, and that content can reliably be reposted -- specs are good here -- problem solved (well, not quite, but closer!).

"I have an account on social networking site A. I go to a new social networking site B. I want to be able to automatically add all my friends from site A to site B." There are presumably other requirements, e.g. "site B must not ask the user for the user's credentials for site A" (since that would train people to be susceptible to phishing attacks). Also, "site A must not publish the data in a manner that allows unrelated users to obtain privacy-sensitive data about the user", for example we don't want to let other users determine relationships that the user has intentionally kept secret [1]. [1] http://w2spconf.com/2008/papers/s3p2.pdf

For example, if I copy a sentence from Wikipedia and paste it in some word processor, it would be great if the word processor offered to automatically create a bibliographic entry.

If I copy the name of one of my Facebook "friends" and paste it into my OS address book, it would be cool if the contact information was imported automatically. Or maybe I pasted it in my webmail's address book feature, and the same import operation happened..

If I select an E-mail in my webmail and copy it, it would be awesome if my desktop mail client would just import the full E-mail with complete headers and different parts if I just switch to the mail client app and paste.

1. Service and product provider can't include the meaning of the things they publish in HTML. For example, how do you find out where the price of a book is located in, say, a page from Amazon? Now, people that want to use this data are forced to perform *screen scraping*, that is, there is a need for publisher-push rather than consumer-pull semantics.

2. People doing data mash-ups need to learn a plethora of APIs/formats while all they would likely want is *one data model* + and a bunch of vocabularies covering the domain.

When writing HTML (by hand or indirectly via a program) I want to isolate at describe what the content is about in terms of people, places, and other real-world things. I want to isolate "Napoleon" from a paragraph or heading, and state that the aforementioned entity is: is of type "Person" and he is associated with another entity "France". The use-case above is like taking a highlighter and making notes while reading about "Napoleon". This is what we all do when studying, but when we were kids, we never actually shared that part of our endeavors since it was typically the route to competitive advantage i.e., being top student in the class.

Simple programs should thus be able to answer questions like:

Under what license has a copyright holder released her work, and what are the associated permissions and restrictions? * Can I redistribute this work for commercial purposes? * Can I distribute a modified version of this work? * How should I assign credit to the original author?

Paul maintains a blog and wishes to "mark up" his existing page with structure so that tools can pick up his blog post tags, authors, titles, and his blogroll, and so that he does not need to maintain a parallel version of his data in "structured format." His HTML blog should be usable as its own structured feed.

Paul sometimes gives talks on various topics, and announces them on his blog. He would like to mark up these announcements with proper scheduling information, so that his readers' software can automatically obtain the scheduling information and add it to their calendar. Importantly, some of the rendered data might be more informal than the machine-readable data required to produce a calendar event. Also of importance: Paul may want to annotate his event with a combination of existing vocabularies and a new vocabulary of his own design.

Tod sells an HTML-based content management system, where all documents are processed and edited as HTML, sent from one editor to another, and eventually published and indexed. He would like to build up the editorial metadata within the HTML document itself, so that it is easier to manage and less likely to be lost.

Tara runs a video sharing web site. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline, in his blog, along with any licensing information (Creative Commons?) about the video.

Lucy is looking for a new apartment and some items with which to furnish it. She browses various web pages, including apartment listings, furniture stores, kitchen appliances, etc. Every time she finds an item she likes, she can point to it, extract the locally-relevant structured data, and transfer it to her apartment-hunting page, where it can be organized, sorted, and categorized. Extracting relevant information from web pages is a still a very manual process. Unless a particular site allows you to add items to a shopping cart, or a "favorites list", it is very difficult to store relevant details for later use. The use of a web browser to remember items from multiple sites is even more daunting, usually resulting in dropping web tools in favor of desktop tools such as a text editor. There is no reason why copying concepts to a web-based clipboard should be so difficult - the idea has failed to gain traction until now because there has not been an easy-to-implement data model and mark-up mechanism allowing people to right-click and store items into a semantic clipboard. Lucy could then use an website called TheBigMove.com to organize all aspects of her move, including items that she is tracking for the move. She would go to her "To Do" list and add the semantic objects she had cut from other places. To ensure that sites don't try and steal any of her web clipboard objects, she would be required to click a browser-activated button labeled "Upload Web Objects", which would ask her which web objects she would like to share with the web page.

Patrick keeps a list of his scientific publications on his web site. Using the BibTex vocabulary, he would like to provide structure within this publications page so that Ulrich, who browses the web with an RDFa-aware client, can automatically extract this information and use it to cite Patrick's papers without having to transcribe the bibliographic information.

A mechanism to mark up music, video and other digital content in a blog or website. The Bitmunk Firefox plug-in would then detect the purchase information required from the embedded meta-data in the same web page that the browser is viewing. For example, while browsing the Scissorkick website, it would be nice to be able to purchase the music directly from one's favorite online music store without leaving the page. Marking up the music information in a way that works across websites would hopefully help drive a universal set of tools to enable this use case.

It's difficult to price-check music and purchase it without having to go through a special website or application to acquire music.

A mechanism to annotate which proteins and genes one is referencing in a blog entry, so that colleagues can determine if they are talking about the same thing without having to read long series of numbers (or whatnot).

Paul wants to publish a large vocabulary in RDFS and/or OWL. Paul also wants to provide a clear, human readable description of the same vocabulary, that mixes the terms with descriptive text in HTML.

As a browser interface developer, I find it really annoying that I have to keep creating new screen scrapers for websites in order to build UIs that work with page data differently than the page developer intended. The data is all there on the page, but it takes a great amount of effort to extract it into a usable form. Even worse, the screen scraper breaks whenever a major update is made to the page, requiring me to solve the scraping problem yet again. Microformats were a step in the right direction, but I keep having to create a new parser and special rules for every new Microformat that is created. Every time I develop a new parser it takes precious time away from making the browser actually useful. Can we create a world where we don't have to worry about the data model anymore and instead focus on the UI? Browser UIs for working with web page data suck. We can add an RSS feed and bookmark a page, but many other more complex tasks force us to tediously cut and paste text instead of working with the information on the page directly. It would increase productivity and reduce frustration for many people if we could use the data in a web page to generate a custom browser UI for calling a phone number using our cellphone, adding an event to our calendaring software, including a person in our address book, or buying a song from our favorite online music store.

How do you merge statements made in multiple languages about a single subject or topic into a particular knowledge base? If there are a number of thoughts about George W. Bush made by people that speak Spanish and there are a number of statements made by people that speak English, how do you coalesce those statements into one knowledge base? How do you differentiate those statements from statements made about George H.W. Bush? One approach would be to use a similar underlying vocabulary to describe each person and specify the person using a universally unique string. This would allow the underlying language to change, but ensure that the semantics of what is being expressed stays the same.

When Google answers a search like "Who is Napoleon" you get an answer, but where is the disambiguation? How does it determine the context for the search? There are many dimensions to "Napoleon" and Google statistically guessed one based on link density of its subjectively assembled index and page rank algorithm. How do you as writer or reader efficiently navigate the many aspects/facets associated with the pattern: "Napoleon"? What if the answer you are looking for is in the statistically insignificant links and not the major links?

Sam has posted a video tutorial on how to grow tomatoes on his video blog. Jane uses the tutorial and would like to leave feedback to others that view the video regarding certain parts of the video she found most helpful. Since Sam has comments disabled on his blog, Jane cannot comment on the particular sections of the video other than linking to it from her blog and entering the information there. This is not useful to most people viewing the video as they would have to go to every blogger's site to read each comment. Luckily, Jane has a video player that is capable of finding comments distributed around blogs on the net. The video player shows the comments as a video is being watched (shown as sub-titles). How can Jane specify her comments on parts of a video in a distributed manner?

REQUIREMENTS

Arbitrarily extensible by authors

Mapping to RDF

Ability to create groups of name-value pairs (i.e. triples with a common subject) without requiring that the name-value pairs be given on elements with a common parent

Ability to have name-value pairs with values that are arbitrary strings, dates and times, URIs, and further groups of name-value pairs

Encoding of machine-readable equivalents for times, lenths, durations, telephone numbers, languages, etc

API

Discourage data duplication (e.g. discourage people from saying <title>...</title> <meta name="dc.title" content="..."> <h 1 property="http://...dc...title">...</h 1>

The Microformats community has been struggling with the abbr design pattern when attempting to specify certain machine-readable object attributes that differ from the human-readable content. For example, when specifying times, dates, weights, countries and other data in French, Japanese, or Urdu, it is helpful to use the ISO format to express the data to a machine and associate it with an object property, but to specify the human-readable value in the speaker's natural language.

Pages should be able to expose nested lists of name-value pairs on a page-by-page basis.
It should be possible to define globally-unique names, but the syntax should be optimised for a set of predefined vocabularies.
Adding this data to a page should be easy.
The syntax for adding this data should encourage the data to remain accurate when the page is changed.
The syntax should be resilient to intentional copy-and-paste authoring: people copying data into the page from a page that already has data should not have to know about any declarations far from the data.
The syntax should be resilient to unintentional copy-and-paste authoring: people copying markup from the page who do not know about these features should not inadvertently mark up their page with inapplicable data.
Generic syntax and parsing mechanism for Microformats

Microdata Problem Descriptions

Contents

Improving the browsing experience

Improving the search experience

Uncategorised

REQUIREMENTS

Navigation menu