2005-12-16T00:00:00-05:00

Welcome to the Splogosphere!
UMBC study estimates that 75% of posts to English language weblogs are spam

Baltimore, December 16, 2005

A weblog monitoring system developed by UMBC Ph.D. student Pranam Kolari shows that a new form of spam -- spam blogs or splogs -- has quickly become a serious problem.

Splogs are "fake" weblog sites that have been set up to carry paid advertisements, promote affiliated web sites by increasing their PageRank, and to get new sites noticed by search engines. The content included in the splogs is typically random nonsense text, text plagiarized from other websites or content hijacked from other blogs. Most of these splogs are created and maintained automatically.

A part of Kolari's Ph.D. research he has implemented Memeta -- a system to discover blogs, monitor their activity and build up a database of metadata about them. Memeta currently has information on over six million blogs worldwide. As part of the metadata analysis, his system identifies the blog's language and also categorize it as being a legitimate blog or a splog. These modules were developed using machine learning techniques from artificial intelligence that base their judgment on blog's text content, but also it's structure and relationships to other blogs and web sites. The machine learning approach allows these modules to be periodically retrained so that they will adapt and maintain their accuracy as blog usage changes. Kolari estimates that Memeta's current accuracy at language identification at 99% and about 90% for splog identification.

Using his system, Kolari analyzed all new blogs posts collected using a web service offered by weblogs.com. Over the last four weeks over 40 million posts from about 14 million blogs were analyzed. The study shows that 75% of these posts were from blogs judged to be splogs. As shown in the charts below pings from blogs average around 8K per hour and those from splogs average around 25K.

25% of blog posts are from from legitimate blogs

75% of posts are from from spam blogs

Of the 14 million sources which pinged weblogs.com during the study, splogs made up more than half, shown in this graph.

A paper on Memeta will be presented in March at the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. Hourly data from Memeta is available online. For more information, contact memeta@ebiquity.umbc.edu.

The techniques being explored by Kolari and others at the UMBC ebiquity research group can be used by web and blog search engines such as Google and Technorati to identify posts from splogs and remove them from their search results.

]]>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:cc="http://web.resource.org/cc/#" xmlns:news="http://ebiquity.umbc.edu/ontology/news.owl#" xmlns:assert="http://ebiquity.umbc.edu/ontology/assertion.owl#">

<news:News rdf:about="http://ebiquity.umbc.edu/getnews/html/id/31/Welcome-to-the-Splogosphere">

<rdfs:label>

<![CDATA[ Welcome to the Splogosphere ]]>

</rdfs:label>

<news:title>

<![CDATA[ Welcome to the Splogosphere ]]>

</news:title>

<news:publishedOn rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-12-16T00:00:00-05:00</news:publishedOn>

<news:description>

<![CDATA[ Welcome to the Splogosphere! UMBC study estimates that 75% of posts to English language weblogs are spam Baltimore, December 16, 2005 A weblog monitoring system developed by UMBC Ph.D. student <a href="https://ebiquity.umbc.edu/person/html/Pranam//Kolari/">Pranam Kolari</a> shows that a new form of spam -- spam blogs or splogs -- has quickly become a serious problem. <a href="http://en.wikipedia.org/wiki/Splog">Splogs</a> are "fake" weblog sites that have been set up to carry paid advertisements, promote affiliated web sites by increasing their PageRank, and to get new sites noticed by search engines. The content included in the splogs is typically random nonsense text, text plagiarized from other websites or content hijacked from other blogs. Most of these splogs are created and maintained automatically. A part of Kolari's Ph.D. research he has implemented <a href="http://memeta.umbc.edu/">Memeta</a> -- a system to discover blogs, monitor their activity and build up a database of metadata about them. Memeta currently has information on over six million blogs worldwide. As part of the metadata analysis, his system identifies the blog's language and also categorize it as being a legitimate blog or a splog. These modules were developed using machine learning techniques from artificial intelligence that base their judgment on blog's text content, but also it's structure and relationships to other blogs and web sites. The machine learning approach allows these modules to be periodically retrained so that they will adapt and maintain their accuracy as blog usage changes. Kolari estimates that Memeta's current accuracy at language identification at 99% and about 90% for splog identification. Using his system, Kolari analyzed all new blogs posts collected using a web service offered by weblogs.com. Over the last four weeks over 40 million posts from about 14 million blogs were analyzed. The study shows that 75% of these posts were from blogs judged to be splogs. As shown in the charts below pings from blogs average around 8K per hour and those from splogs average around 25K. <center> <div align="center" style="width:80%;align:center;border-style:groove;padding:5px" > <img src="http://memeta.umbc.edu/stats/ebb.ping.blog.7.png" alt="Blog Pings" /> 25% of blog posts are from from legitimate blogs <img src="http://memeta.umbc.edu/stats/ebb.ping.splog.7.png" alt="Splog Pings" /> 75% of posts are from from spam blogs </div> </center> Of the 14 million sources which pinged <a href="http:weblogs.com"> weblogs.com </a> during the study, splogs made up more than half, shown in this graph. <center> <div align="center" style="width:80%;align:center;border-style:groove;padding:5px" > <img src="http://www.cs.umbc.edu/~finin/images/splogpie.png" alt="Pings by Source" /> </div> </center> A <a href="https://ebiquity.umbc.edu/paper/html/id/269/">paper</a> on Memeta will be presented in March at the AAAI Spring Symposium on <a href="http://www.umbriacom.com/aaai2006_weblog_symposium/">Computational Approaches to Analyzing Weblogs</a>. Hourly data from Memeta is available <a href="http://memeta.umbc.edu/">online</a>. For more information, contact <a href="mailto:memeta@ebiquity.umbc.edu">memeta@ebiquity.umbc.edu</a>. The techniques being explored by Kolari and others at the <a href="https://ebiquity.umbc.edu">UMBC ebiquity research group</a> can be used by web and blog search engines such as Google and Technorati to identify posts from splogs and remove them from their search results. ]]>

</news:description>

<news:uri>

<![CDATA[ http://memeta.umbc.edu/ ]]>

</news:uri>

</news:News>

<rdf:Description rdf:about="">

<cc:License rdf:resource="http://creativecommons.org/licenses/by/2.0/"/>

</rdf:Description>

</rdf:RDF>