The Evolution of RSS | 2

[previous] [next]

The Evolution of RSS

The Standards and People Behind RSS

Before we delve into the various versions of RSS, let's first take a look at the standards and people behind RSS.

XML

RSS files are XML files. XML (Extensible Markup Language) is a set of rules for defining syntactic tags that break a document into parts and identify the different parts of the document. It is a meta-markup language that defines a syntax used to define other domain-specific, structured markup languages (like RSS). XML is designed to be easily processed by computers, for storing and exchanging data.

XML gives data structure and meta information. This increases the worth of the data many times over, allowing the content to be used in a wide variety of applications. However, all parties that exchange XML data must agree to a mutual vocabulary (the syntax and semantics) of the data, or chaos will ensue. The XML 1.0 spec provides a mechanism for this with a DTD, which describes the tags and hierarchy that the XML data may include. Unlike HTML, which has a fixed set of tags, XML lets you create your own.

DTDs

RSS 0.91 and 0.92 are based on Document Type Definitions (DTDs). A DTD defines rules that constrain the structure of an XML document or set of XML documents. The DTD lists all legal markup and specifies where and how the markup may be included in a document. Example:

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
    "https://my.netscape.com/publish/formats/rss-0.91.dtd>

Particular document instances can be compared to the DTD. A document that matches the constraints is said to be valid. This is how some RSS validators work.

RDF

In contrast to Netscape's RSS 0.91 that uses DTDs, RSS 1.0 is, as was RSS 0.9, an application of Resource Description Framework (RDF). RDF is a framework for describing and interchanging metadata. The RDF framework is extensible and allows adding new types of entities. It gives meaning to resources to enable automated processing of Web resources.

RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. These items, known as resources, can be almost anything, providing they have a Web address. Resources are identified by a Uniform Resource Identifier (URI). The most common form of URI is a URL, like https://www.webreference.com/. You can associate metadata with a Web page, graphic, audio file, a GIF, and so on.

Through RDF, independent communities can develop vocabularies that suit their specific needs, and share vocabularies with other communities. In order to share vocabularies, term meanings must be spelled out in detail. The descriptions of these vocabulary sets are call XML Schemas.

XML Schemas

XML Schemas may eventually supplant DTDs as the primary mechanism for constraining XML data. An XML Schema, which is the format of an XML document, serves the same function as a DTD while correcting some of its limitations. While DTDs constrain the type of tag that goes into an XML document, they have no way to constrain ranges of a given attribute (i.e., age between 0 and 150 years). Schemas give developers more powerful data typing for both element content and attribute values.

XML Schemas supplement the basic DTD mechanism included in XML Version 1.0 with a more rigorous framework for declaring structure and contents of XML documents. XML Schemas provide much stronger data typing for attribute values that DTDs lump together as CDATA. Consider the following XML fragment:

<!ELEMENT birthday (#PCDATA)>
...
<birthday>Green</birthday>

Since XML 1.0 doesn't support the inclusion of semantic information about the format of character data, the XML parser wouldn't know that the value Green was supposed to be a date field. But the new XML Schema language allows extended information about the type of character data:

<element name="birthday" type="date">

disambiguating the type of the element "birthday." The date range could also be constrained.

A schema defines the meaning, characteristics, and relationships of a set of properties. The RDF language allows each document containing metadata to clarify which vocabulary is being used by assigning each vocabulary a Web address. The schema specification language is a declarative representation language influenced by ideas from knowledge representation (e.g. semantic nets, frames, predicate logic) as well as database schema specification languages, and graph data models. RDF uses the idea of the XML namespace to effectively allow RDF statements to reference a particular RDF vocabulary or "schema."

Namespaces

What if you wanted to include elements or attributes from different document types? You can't combine multiple DTDs for a single document, but you can use a feature called "namespaces." An XML namespace is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names.

Namespaces disambiguate elements with the same name by assigning elements and attributes to URIs. They group all related elements and attributes from a single XML application together so software can recognize them easily. Namespaces help avoid element name collisions that would confuse XML applications. Namespace-based modularization allows for compartmentalized extensibility, allowing RSS 1.0 to be extended.

R.V. Guha co-creator of RDF, RSS 0.9 and 1.0 says, "Namespaces allow distributed extensibility, which avoids conflicts as different people add different tags."

Jonathan Eisenzopf, creator of the XML::RSS module and co-author of the RSS 1.0 spec says, "Namespaces give people on the Web the ability to naturally extend the RSS spec to meet their specific needs as opposed to a process where we are sort of generally defining the tags we want to put in, and forcing everybody to use it. RSS 1.0 combines extensibility with simplicity."

Aaron Swartz, RSS-Dev working group member and co-author of the RSS 1.0 spec says, "The first version of RSS (0.9) included namespaces and RDF. RSS 0.91 took those out, but added in more information like description and publisher. RSS didn't seem to be going anywhere, so an international group of RSS users and developers (RSS-DEV) formed to move it along. We took the best bits from 0.9 and from 0.91 and put them together for 1.0."

At its core, RDF data consists of elements and attached attribute/value pairs. Elements can be any Web resource described by a URI. Attributes are named properties of the elements, which have values. Here's an example RDF document:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#"
            xmlns:dc="https://purl.org/dc/elements/1.1/">
  <rdf:Description about="https://webref.com/authoring/languages/xml/rss/1/">
    <dc:creator rdf:resource="https://www.aking.com/"/> 
  </rdf:Description>
</rdf:RDF>

The subject is this article, which has a property of "creator" whose value is the resource identified by "https://www.aking.com/."

Triples of the data model

Here's a tabularized representation of the above RDF triple:

Number	Subject	Predicate	Object
1	https://webref.com/au thoring/languages/xml/rss/1/	https://purl.org/dc/elements/1 .1/creator	https://www.aking.com/

Directed Labeled Graphs

As you can see above, the base element of the RDF model is the triple (shows relationships): a resource (the subject) is linked to another resource (the object) through an arc labeled with a third resource (the predicate). You can say that <subject> has a property <predicate> with a value of <object>. For example, the triple above could be read as "Andy King is the creator of this article." All the triples result in a directed labeled graph, whose nodes and arcs are all labeled with qualified URIs. Here's a visual representation of the above RDF triple (created by the W3C's SiRPAC):

simple RDF graph

Dan Libby says, "Graphs composed of triples are what RDF is all about. All the XML stuff (namespaces, etc.) is syntactic sugar, as Rael says. The same data you see in a .xml or .rdf file can usually be expressed more concisely as a set of tab or comma delimited values (triples) in a text file.

"Graphs are important because they define the data set and all the relations between the data. Without them, focusing solely on the XML representation, you just have an overly complex, verbose method of sending strings.

"I think it is this difference in focus (graph model vs. XML representation) that really spells the difference between the 0.92/3 people and the 1.0 people. Both groups have good goals, but they are slightly different, which makes communication difficult."

Aaron Swartz says, "Graphs are also very important for the Semantic Web, because they allow many RDF documents to put together. Because of them, you can combine multiple RDF documents into a combined graph, containing the data of all them and allowing you to see connections between documents that you wouldn't see otherwise. Tim Berners-Lee provides an example of this.

"It shows how three documents, containing different information (a person's homepage, information about a meeting, and a P3P policy) can be combined into one graph."

RSS 0.9 forms an unconnected graph (e.g. image and textinput are "hanging") while 1.0 forms a connected graph, with everything connected together. This means for 0.9 there is no relationship between the channel and the image or textinput, but with 1.0 there is.

Dan Libby says, "If you had all this stuff in a big RDF database, and you entered a query such as 'give me the images for the WebReference channel,' it would fail. Of course, in 0.9, it would fail anyway, because there is no unique identifier for the channel that says it is the WebReference channel. In 1.0, the channel rdf:about serves this purpose."

The Semantic Web

This RDF grammar gives meaning to resources, and will eventually give us what Tim Berners-Lee calls the "Semantic Web." This is a utopian Web with metadata and meaning attached to content so machines can process it, and not just for display purposes but for various applications. If the semanticians have their way, autonomous software agents will make your plane reservations, search engines will become more relevant, and the Web as a whole will become much more usable. We talked to Dan Libby about his original unreleased "futures" RSS document, and his vision for RSS:

WR: The "futures" document, tell us more about that, was that your first version of RSS?

DL: Yes. That was what I proposed as 0.9. By that time, I had been in several talks with Guha, who was championing RDF at Netscape, and he had gotten me in touch briefly with Eric Miller and Dan Brickley, so they both had a chance to review it.

Of course, then it had to be approved by the marketing folks at Netcenter, and they wanted a simpler format that could easily be read/written by hand and that was less error prone. 0.9 was the next proposal, as a compromise, and it was accepted. So it is still valid RDF, but it is not as useful for an RDF aggregration database.
WR: Your "futures" document appears remarkably similar to RSS 1.0, as it uses Dublin Core, rdf:Seq(uences) of items. Is RSS 1.0 close to what you had in mind from the start?
DL: Yes. The format itself is close, though I still haven't seen the types of applications that I was envisioning. I thought that if someone were to combine a real RDF database (e.g. Guha's rdfdDB) with thousands of RSS site descriptions, that it would open the door to truly powerful filtering. Think if every site and every news organization published RSS feeds and then other sites aggregated them and allowed users to setup news filters the way your mail client lets you setup mail filters. Each of these highly customized filters could be called an "agent." For example, suppose I'm interested in XML-RPC, PHP and tennis. I could setup filters that search across all the aggregated news sources and present me with the things I'm interested in, regardless of source. Thus, we shift away from today's provider-centric model to a user-interest-centric model.

WR: What about aggregators like Meerkat and My Userland?

DL: I think they are both on the right track, but (to my knowledge) neither has packaged it up with the sort of filtering and user-centric or "agent" mechanism that I'm talking about. NewsIsFree still uses the concept of a "box," which represents a single RSS channel. I had hoped to move past this in My Netscape, but was never given the chance. I want to see a "my" page that is truly about my interests, regardless of source. I don't care if "Gardening Daily" published the RSS article, I still want to see it. Of course, I would still like to have the option of adding a channel by provider, but it should not be the only method available.

Further, because there is not yet any sort of Universal RDF Descriptions repository, it is difficult for providers to tag data with meaningful shared metadata. For example, if I want to use dc:subject to tag an article as belonging to "politics/libertarian/anarcho-capitalism," I can, but you (the receiver) won't necessarily know what that is, or have any means of finding out, because we don't have a shared classification system. Instead, I'd like to be able to point the dc:subject at a URI that represents "politics/libertarian/anarcho-capitalism" in a shared taxonomy.

WR: What are the advantages of RDF?

DL: As I see it, RDF enables computers to agree on a common description for complex things (see https://swag.semanticweb.org/, an effort to create a common language for the Semantic Web). Let's suppose we are talking about people. If an article were tagged with the keywords "President Bush," you would assume it refers to George W. Bush, the current president. A computer would have no way of distinguishing between George W. and his father, and it certainly couldn't tell you anything interesting about either. If instead, it were tagged with "https://taxonomy.rdf.org/people/usa/presidents/#52," there would be no ambiguity. If both my filter (above) and the RSS feed provided this URI, then there would be an exact match, and I would see the article. Otherwise, it might fall back to a keyword search.

It also enables one to define meaningful links between such objects. Thus, as I'm creating my filter, I would really be surfing through a multi-dimensional graph. When I come across the node representing George Bush, I can click on the "predecessor" link to find "Bill Clinton," or the "father" link to find "George Bush, Sr." Any relationships anyone has ever created to or from this "President Bush" are instantly accessible.

A more down to earth benefit of using RDF is the ability to reuse existing RDF vocabularies, such as Dublin Core. So there is greater interoperability, and we are using existing building blocks rather than reinventing the wheel.

Yet another is the flexibility that is gained by using XML namespaces together with the RSS modules concept to allow users to easily expand the RSS vocabulary, and to maintain backwards compatibility.

For more on the history of RSS see Dan Libby's post on the XML syndication discussion group.