…News at 11…
One vision that many of us have is that the Web is evolving into a collective brain for human society, complete with long term memory (web pages), active behaviors (web services and agents), a stream of consciousness (the Blogosphere) and a nervous system (Internet protocols). So it’s in this context that I read the news from Google, Yahoo and Microsoft that the big search engines have agreed on a the SiteMaps protocol. It’s a small step for a machine, but a a bigger leap for machinekind.
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.
The Semantic Web community should consider doing something similar for Semantic Web data. FrÃ©dÃ©rick Giasson has addressed one aspect with his PingtheSemanticWeb protocol and site. Last year Li Ding worked out a sitemap like format and protocol to facilitate RDF crawlers like Swoogle to keep their collections current more easily. We’ve not yet had a chance to experiment or promote the idea — maybe the time is right now. It might be especially helpful in exposing pages that have embedded RDFa or other content that could be extracted via GRDDL. Currently, the only way to discover them is with exhaustive crawling, something small systems like Swoogle can not afford to do.