Google’s HTML Statistics

January 25th, 2006

Google has published a study of Web Authoring Statistics in which they analyzed the HTML use of over one billion web pages.

“In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. ”

The results have lots of interesting data on what attributes are commonly used (and misused) with what classes, and for some, what popular values are. No sign of embedded RDF or even of microformats. Maybe next year.

We’ve used Swoogle to do a similar analysis for RDF documents in general, and for FOAF documents in particular. An interesting study would be to analyze what features of RDF, RDFS and OWL are used for Swoogle’s collection (about 850K documents with RDF content as of the beginning of this year. I don’t think we can do this from our database, but would have to go back and process the cached documents, probably with a special purpose, light weight parser/analyzer (which is what Google did in their HTML study).