Google sitemaps
Tim Finin, 7:07pm 3 June 2005Google has published a sitemap protocol allowing site owners to inform crawlers of the URLs on the site that are available for crawling. Since the URLs can include parameters, this allows a site to expose all or parts of its “hidden web”.
“A Sitemap consists of a list of URLs and may also contain additional information about those URLs, such as when they were last modified, how frequently they change, etc.
Sitemaps are particularly beneficial when users can not reach all areas of a Web site through a browseable interface — i.e. users are unable to reach certain pages or regions of a site by following links. For example, any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines.
…
Please note that the Sitemap Protocol supplements, but does not replace, the crawl-based mechanisms that search engines already use to discover URLs. By submitting a Sitemap (or Sitemaps) to a search engine, you will help that engine’s crawlers to do a better job of crawling your site.”
You can also define relevant attributes for each URL including how often the URL changes, when it was last modified, and its priority relative to other URLs on the same site.
Li Ding defined a similar scheme for RDF documents some months ago as part of his work on Swoogle.
Related posts:
June 4th, 2005 at 12:45 pm
Filip added a google sitemap page for the EBBlog.