| Building intelligent systems in open, heterogeneous, dynamic, distributed environments | 11 May 2008, 21:21:12 EDT ![]() |
|||
BlogVox: Separating Blog Wheat from Blog Chaff Authors: Akshay Java, Pranam Kolari, Tim Finin, James Mayfield, Anupam Joshi, and Justin Martineau Book Title: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007) Date: January 07, 2007 Abstract: Blog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections. Type: InProceedings Google Scholar: RLWcRRhNOBoJ Number of Google Scholar citations: 3 [show citations] Number of downloads: 670 Available for download as
Active Projects Bookmark at: Digg | Del.icio.us | Connotea | CiteULike |
| Home | About Us | Contact Us | Site Map | Legal | Privacy Copyright © 1999-2008 UMBC ebiquity research group. Copyright © 2003-2008 Site design and RGB engine code by Filip Perich. XG Page gen 0.025 sec. |