Third Annual Conference on Weblogs and Social Media (ICWSM 2009)
The ICWSM 2009 Spinn3r Dataset
May 17, 2009
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed). This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs. To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection. Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.
San Jose, CA