Google’s HTML Statistics
Tim Finin, 1:32pm 25 January 2006Google has published a study of Web Authoring Statistics in which they analyzed the HTML use of over one billion web pages.
“In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. ”
The results have lots of interesting data on what attributes are commonly used (and misused) with what classes, and for some, what popular values are. No sign of embedded RDF or even of microformats. Maybe next year.
We’ve used Swoogle to do a similar analysis for RDF documents in general, and for FOAF documents in particular. An interesting study would be to analyze what features of RDF, RDFS and OWL are used for Swoogle’s collection (about 850K documents with RDF content as of the beginning of this year. I don’t think we can do this from our database, but would have to go back and process the cached documents, probably with a special purpose, light weight parser/analyzer (which is what Google did in their HTML study).

January 25th, 2006 at 4:31 pm
Actually, there was some mention of microformats. For example, the second most popular rel value was ‘license’, from the rel-license microformat.
January 25th, 2006 at 4:47 pm
Microformats, atleast one (rel=tag), does appear in their hyperlink charts. A quote from their analysis — “The rel attribute is not used all that much, but it is still used enough to matter”, also acknowledges their significance
. It would also be interesting to find what subset of their dataset is contributed by the blogosphere.
January 25th, 2006 at 7:06 pm
“No sign of embedded RDF”
true.
” or even of microformats.”
false.
1. The very page you linked to for “Web Authoring Statistics” http://code.google.com/webstats/index.html itself both mentions “microformats.org” and links to http://microformats.org
2. The “Page Headers” page http://code.google.com/webstats/2005-12/pageheaders.html notes that the XFN microformat is the most popular HTML metadata profile: “…people do use the profile attribute, though. The three most-often used values are http://gmpg.org/xfn/1, http://dublincore.org/documents/dcq-html/, and http://gmpg.org/xfn/11. This makes XFN the most popular HTML metadata profile!”
3. The “a element” page http://code.google.com/webstats/2005-12/element-a.html found that three of the most popular ‘rel’ attribute values were microformats: #1 rel-nofollow, #2 rel-license, #5 rel-tag (as mentioned by the previous commenter).
Tantek
January 26th, 2006 at 10:33 am
It’s nice to know we have readers. … and to know the truth. Mea culpa.