ICWSM-2007 Weblog dataset released

September 8th, 2006

The International Conference on Weblogs and Social Media (26-28 March 2006, Boulder CO, USA) is offering a large blog dataset to conference participants. The data release comprises a complete set of weblog posts collected by Nielsen BuzzMetrics for May 2006. It consists of about 14M weblog posts in XML format from 3M weblogs and is annotated with 1.7M blog-blog links. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type. The compressed dataset is over 10GB. In addition to the data, the conference organizers hope to release processing code and a shared repository for those making use of the dataset. Details on requesting the dataset are available online.