This talk was given at Web Directions South 2009, on October 9th.
The session introduced RDFa, and then posed the question, what happens when web-pages get smart?, as a way to look at the many benefits that arise from putting data into web pages.
Topics range from vertical search in specialist spheres such as chemistry, through searching for images and videos that have a particular license, to samples showing how to enrich the user interface with Tweets and book covers.
8. Improved search
• Precise meaning means better indexing.
• Additional data means more axes to search
along.
• Additional data also makes for better
search interface.
43. Mark Birbeck’s Twitter account name is
<span
typeof=""
rel="foaf:accountServiceHomePage"
resource="http://twitter.com"
property="foaf:accountName"
>markbirbeck</span>
and Ben Adida’s is
<span
typeof=""
rel="foaf:accountServiceHomePage"
resource="http://twitter.com"
property="foaf:accountName"
>benadida</span>.
44.
45. Single extension point
• RDFa allows us to express any data
• Creates possibility of a generic binding
technique
A quick example of RDFa. By creating a link to a CC license in your web page, you've essentially put in 'data'. But whilst the text "Feel free to take my images under the..." makes it clear to a human what is going on, it's difficult for a machine to understand.
By using @rel we can make it more explicit.
HTML already supported @rel on the 'a' element, but RDFa allows it everywhere. For example, we can use it to indicate the license of a number of images. As you can see, RDFa can be as simple as adding one or two attributes, but it also has the power to cope with very complex data structures.
Now we know roughly how to make out pages smart, let's look at what happens when we do.
The first thing that we can do is improve the search experience.
If Marie Curie were researching today, she might well use a blog and Twitter. She wouldn&#x2019;t be writing about her breakfast, of course, but using these tools to pass around research results, speculate on theories, and so on. Funnily enough, that's what the web was devised for.
But the problem is, if we try to find Marie Curie's work, it will be difficult. To illustrate, if we search Google for a chemical like &#x2018;benzene&#x2019;, all we will get are very general results. They may be useful to the general public (e.g., Wikipedia descriptions, safety information, and so on) but they'll be of no interest to a chemist.
Here's a specialist blog by a chemist. It mentions benzene, but of course it's not going to appear anywhere when it comes to the search results on Google.
Sites do exist to try to solve this problem, but their approach is to only index pertinent content. But the web is changing all the time, so invariably these specialist search sites miss something. The problem almost certainly needs tackling at the level of the big search engines.
How do we improve search indexing, then? Simply by adding a more precise definition of a term. In this case we are providing the much more precise identifier '241', rather than relying on the more general term, 'benzene'.
The RDFa @content attribute can be used in all sorts of situations, to provide information that is not obvious from the prose. Imagine a newspaper article that simply says 'tomorrow'; we can use @content to give the precise date, so that the article still has meaning in 5 years time.
By putting more information into our pages, we not only get more precise searching of the content, but we can also search amongst for things like all images with a certain license, or all videos of a certain size.
We saw an example of adding licensing information, when I briefly introduced the idea of RDFa. But how does this play itself out with the search engines?
Here we see how these extra values manifest themselves in the search interface.
Google Video is slightly more complex, but it's a similar principle, so let's look at the markup. First, we have a 'normal' embedded video in our markup.
Now, we can add properties to the image. The great thing here is that Google has chosen to use the same format that Yahoo! devised for video properties, which is great news for content publishers. Note the licensing information, but also we have a thumbnail, region information, duration, and so on. All of these properties could be searched.
As an aside, one consequence of attaching the licensing information to the actual object, is that, as Joi Ito says, it makes it easy to move the information around when object it refers to is moved. (Joi Ito was interviewed in the Guardian, 23rd September, 2009.)
Both Yahoo! and Google are moving forward on the idea of being able to 'get things done', right in the search results.
For example, we can see here that a search on Yahoo! for a particular film, returns specially formatted results that contain extra information about the film. Information such as the film length, when the film was released, its rating, an indication of the number of reviews received, and how complimentary they were are all in the search result. Note also that there are extra links for things like showtimes, and so on. When the user clicks, their location can be used to determine film times in their area, dramatically reducing the number of clicks needed to get from the search to buying a ticket for a film.
Here's another example of how search results provide more information than a simple page link. (Apologies for the vanity search!)
Yahoo!'s search also has a customisation facility called SearchMonkey. Here we can see a sample application that I built for the UK's Foreign and Commonwealth Office (FCO). The application is invoked whenever a URL in the search results matches a predefined pattern. So in this case we've said, whenever someone does a search that turns up a page about an FCO job vacancy, then format the results in a more useful way than normal -- in this case, show the salary and location.
Google is doing a similar thing, which it calls 'Rich Snippets'. Here we can see that a search for a restaurant also gives us a link to a map and an indication of the number of reviews and the ratings given by reviewers.
A little diversion here. An important side-effect of using RDFa is that it allows us to own our own data. With RDFa, anywhere that you can put HTML, you can put data -- your blog, company site, and so on.
Sticking with reviews, this site is an example of a blog where each individual blog-post is a film review.
This site does the same thing for book reviews.
Ordinarily you would join a centralised service like Yelp, which would then use your data to make its site more attractive.
That's fair enough, but if you look at the book review blog I just mentioned, it comes up very high in Google when you search on certain book titles. So why would they subsume their reviews into some other site? They've worked hard to get this ranking.
You could even make the same point about selling items; there's no reason that these items couldn't be marked for sale on your blog, and then picked up by Google and Yahoo!, to create their own sales sites.
On this blog, each separate blog post is an individual item for sale. This blog has no RDFa in it, but there is a new vocabulary that could be used, called Good Relations.
Here we see a tool from Yahoo! that allows you to easily generate the necessary RDFa to mark up a product. Once you have the HTML you can then insert it into your blog, or use it as a template in a CMS.
Publishing our own data, and then having it consumed by Google and Yahoo!, could potentially cut out the middle-men like Yelp and eBay, or they start crawling our data, too.
Our final section is to look at how smart pages can be used to enhance the UI
To give a sense of this, recall our chemistry example. Remember that we marked up the page by adding a precise code for benzene, in order to help the search engine. But can use exactly the same information to show a tooltip to readers of the web-page.
Here we see Henry Rzepa's blog, and then I've taken a copy of it, and added a JavaScript RDFa parsing library, which searches out the data, and does things with it. In this case, after the parsing library has loaded and found the information, it creates tooltips based on the chemicals.
Another example, let's imagine that the author of the book review site has written their reviews using the Google Review format. Let's go further, and add an identifier for the book. We can now use this information to go out to the linked data cloud and get the full book title, a picture of the book cover, and so on.
Retrieving book information from the linked data cloud, using an Amazon service.
Similarly, we can go off to the linked data cloud to get the latest Tweets by a person.
Retrieving Tweets from the linked data cloud.
Rather than dropping widgets directly into a document, we put data in, and then bind to the data. By decoupling in this way, we create an enormous amount of flexibility.
Rather than dropping widgets directly into a document, we put data in, and then bind to the data. By decoupling in this way, we create an enormous amount of flexibility.
Already seen Google and Yahoo!. One place where there is quite a lot of momentum for RDFa is in publishing government data.
Here in Australia you have data.australia.
The data.australia site has lots of useful data, and usefully, each data set is marked up using RDFa.
The Central Office of Information in the UK was tasked with centralising all job vacancies and consultations across all government web-sites.
A job vacancy is the usual collection of job title, salary range, location and so on.
There are many different sites, each with their own vacancies, laid out in different ways.
Similarly, consultations have a &#x2018;standard&#x2019; set of information; the title, opening and closing dates, who to contact with your feedback, and so on.
The usual &#x2018;solution&#x2019; would be to get each department to put their vacancies or consultations into a centralised database, but that would require quite an upheaval. However, with RDFa, all that needs to happen is that the HTML publishing process is slightly modified to output RDFa. And note that this works, even though each department is using a different technique to publish their HTML.
The final architecture gives great flexibility, and is quick to implement. In a talk at SemTech, Google indicated that their Rich Snippet launch partners (such as Yelp) had been able to add RDFa and Microformat support in little more than a day.