A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Resource Identifiers

From WHATWG Wiki
Jump to navigation Jump to search

Many websites make use of the same popular externally-loaded resources, such as images, style sheets, and JavaScript libraries. Many copies of these same exact files may be repeatedly loaded into a browser's cache, because although these resources are identical, they are locally served with different URLs. Thus, there is no way for the browser to know that:

http://www.my-site.com/javascript/scriptaculous-1.5.js

has the exact same content that it previously cached as:

http://www.your-site.com/js/script.aculo.us.js

This is a proposal for a new attribute of LINK, SCRIPT, IMG, and OBJECT elements (and possibly IFRAME elements?) that would allow for unique, unchanging content to be identified and cached, thus helping to augment the normal caching mechanisms of web browsers, (and speed up Web 2.0!)

In brief, my solution is the creation of a resource identifier:

<img src="my-site-images/valid-html5.png"
     res="org.html5.simon.valid-html5.png" />

Use Case Description

The most useful area for a resource identity attribute would be with large, popular JavaScript libraries. Let us consider images first.

Icon Badges

Many websites today show icon badges, small common images that serve a variety of purposes, such as the "W3C valid" pictures, the "Get Adobe Reader" and "Get Adobe Flash Player" links, the Linux penguin, etc. Although these images are small, repeatedly downloading them adds to the overall amount of traffic on the Internet.

In the case of the W3C, they permit a direct link to their servers:

<p>
     <a href="http://validator.w3.org/check/referer"><img
        src="http://www.w3.org/Icons/valid-xhtml10"
        alt="Valid XHTML 1.0!" height="31" width="88" /></a>
 </p>

This is a good thing. It means that every single web page that displays a W3C validation badge is caching the image by the same URL, http://www.w3.org/Icons/valid-xhtml10. Thus, each new website that you navigate to which uses these images essentially have a shared resource, which the browser knows it already has cached by the same resource id, the URL.

The problem is that when a particular icon badge does happen to go out of cache, or when it is being downloaded for the first time, just one website has to handle the load of all calls to get that image. For a large and important site like the World Wide Web Consortium, this may not be a problem, but for many other sites, this can dramatically increase bandwidth costs.

The website thiefware.com, which promotes ethical uses of software, provides free "This site is ThiefWare Free!" icon badges. As a smaller site, they ask their users to not hotlink to these images, as they would not be able to handle the bandwidth costs. Thus, users of these images download them to their own website and serve them locally. This is a good thing for the web servers, as now no one site has to handle all requests for these images, but now web browsers have no particularly good way to know that one of these images is already somewhere in its cache when it encounters them at 2 or more websites, thus it downloads an image that it already has in local memory.

Style Sheets

Many webmasters today take advantage of open source software around which an entire site can be built, such as WordPress installations for blogging and MediaWiki installations for wikis. Although each installation can have it's components modified, many do not, thus:

http://en.wikipedia.com/skins-1.5/monobook/main.css?97

and

http://wiki.whatwg.org/skins-1.5/monobook/main.css?97

may carry the exact same style information, but again, there is no way for a web browser to know, (and be assured) of this. Thus, one who navigates to many different blogging sites and wiki sites may be (needlessly) downloading the same style sheets over and over again, without realizing that the same exact file is already locally available to the browser.

JavaScript Libraries

The best reason for enabling a common resource identifying solution would be to cut the costs associated with browsers downloading the large JavaScript libraries. One of the most oft-heard compliants about these libraries is with their size footprint. This has been a major hinderance in wide-spread adoption of these wonderful web application-building code repositories.

These libraries are very large for a number of reasons, including cross-scripting concerns, the long name identifiers that were specified by the ECMAScript and DOM specification, and by JavaScript not being quite as powerful "out of the box" as many scripters would like it to be.

If every website that made use of, for instance, Dojo Toolkit, Prototype JavaScript Framework, JQuery, or Script.aculo.us could rely on that library to likely be already cached in the user's browser, webmasters would more whole heartly use these libraries.

Other Media

A reusable Flash "button" whose behavior is set via HTML would be another example. It would be nice for a browser to be able to cache a commonly used, HTML OBJECT-tag configurable Flash file.

Proposed Solutions

My Solution

I would like to propose a resource identifier attribute, named res for the sake of brevity:

<link rel="stylesheet" type="text/css"
      href="local/pretty-divs.css"
      res="com.pretty-divs.pretty-divs-1.5" />
<script type="text/javascript"
        src="my-scripts/prototype1.6.js"
        res="org.prototypejs.assets.2007.11.6.prototype.js"></script>
<img src="images/madeonamac20050720.gif"
     res="com.apple.images.web-badges.madeonamac20050720.gif" />
<object ... res="com.newgrounds.numa-numa-forever.swf" />
<iframe src="local/copy/of/something.html"
        res="org.gov.mil.something.html" />

So the idea is very simple. If a link, script, img, object, or iframe tag (any other tags?) has a res attribute, the browser would first check to see if it has that resource "filed" in its cache with the res value. If it does, great; if it doesn't, then it might look for that resource in its cache filed under the href/src attribute. Finally, having not found either attribute in cache, it would download the resource from the href/src specified.

It would not somehow try to download the resource from the original author based on the res attribute, for one of the design goals here is to prevent any one web server from bearing the majority of the bandwidth costs, (though it might use the res attribute to obtain a checksum or signature of some kind, see #Security Concerns).

To emphasize that res is a static ID rather than a URI, I recommend the use of Java package-style namespaces:

com.(companyName|authorName|etc).resourceName.etc...

Because it would be very important for two resource IDs to truely indicate the exact same resource, I imagine they should always include exact versioning information:

<script ... res="uk.gov.oxford.stiffUpperLip-1.5.js"></script>

Security Concerns

A malicious hacker might try to create "Trojan horses" along the following lines:

<script ... res="com.everyone.trusts.this.script-1.0.js"
            src="my/evil/javascript.js" ></script>

This Trojan could then perhaps emulate the real script (so that the user doesn't suspect anything is wrong) but might do other things for its own nefarious purposes. In order for this to be effective, the victim would have to visit the hacker's website before any other website using com.everyone.trusts.this.script-1.0.js, so that his script would be the one that gets cached under that resource ID.

I am not certain what a good solution to this would be. Perhaps a central registry of digital signatures and/or checksums for trusted resources would be the (very complicated) solution.