<?xml version="1.0"?>

<!DOCTYPE owl [
  <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
  <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">
  <!ENTITY owl "http://www.w3.org/2002/07/owl#">
  <!ENTITY cc "http://web.resource.org/cc/#">
  <!ENTITY news "http://ebiquity.umbc.edu/ontology/news.owl#">
  <!ENTITY assert "http://ebiquity.umbc.edu/ontology/assertion.owl#">]>

<!--
  This ontology document is licensed under the Creative Commons
  Attribution License. To view a copy of this license, visit
  http://creativecommons.org/licenses/by/2.0/ or send a letter to
  Creative Commons, 559 Nathan Abbott Way, Stanford, California
  94305, USA.
-->

<rdf:RDF 
  xmlns:rdf = "&rdf;"
  xmlns:rdfs = "&rdfs;"
  xmlns:xsd = "&xsd;"
  xmlns:owl = "&owl;"
  xmlns:cc = "&cc;"
  xmlns:news = "&news;"
  xmlns:assert = "&assert;">
  <news:News rdf:about="http://ebiquity.umbc.edu/getnews/html/id/31/Welcome-to-the-Splogosphere">
    <rdfs:label><![CDATA[Welcome to the Splogosphere]]></rdfs:label>
    <news:title><![CDATA[Welcome to the Splogosphere]]></news:title>
    <news:publishedOn rdf:datatype="&xsd;dateTime">2005-12-16T00:00:00-05:00</news:publishedOn>
    <news:description><![CDATA[<p> </p>
<b>
<font size=x-large> Welcome to the Splogosphere!</font>
<br>
UMBC study estimates that 75% of posts to English language weblogs are spam
</b>

<p>Baltimore, December 16, 2005</p>

<p> A weblog monitoring system developed by UMBC Ph.D. student <a
href="http://ebiquity.umbc.edu/person/html/Pranam//Kolari/">Pranam
Kolari</a> shows that a new form of spam -- spam blogs or splogs --
has quickly become a serious problem. </p>

<p> <a href="http://en.wikipedia.org/wiki/Splog">Splogs</a> are "fake"
weblog sites that have been set up to carry paid advertisements,
promote affiliated web sites by increasing their PageRank, and to get
new sites noticed by search engines.  The content included in the
splogs is typically random nonsense text, text plagiarized from other
websites or content hijacked from other blogs.  Most of these splogs
are created and maintained automatically. </p>

<p> A part of Kolari's Ph.D. research he has implemented <a
href="http://memeta.umbc.edu/">Memeta</a> -- a system to discover
blogs, monitor their activity and build up a database of metadata
about them.  Memeta currently has information on over six million
blogs worldwide.  As part of the metadata analysis, his system
identifies the blog's language and also categorize it as being a
legitimate blog or a splog.  These modules were developed using
machine learning techniques from artificial intelligence that base
their judgment on blog's text content, but also it's structure and
relationships to other blogs and web sites.  The machine learning
approach allows these modules to be periodically retrained so that
they will adapt and maintain their accuracy as blog usage changes.
Kolari estimates that Memeta's current accuracy at language
identification at 99% and about 90% for splog identification. </p>

<p> Using his system, Kolari analyzed all new blogs posts collected
using a web service offered by weblogs.com. Over the last four weeks
over 40 million posts from about 14 million blogs were analyzed.  The
study shows that 75% of these posts were from blogs judged to be
splogs. As shown in the charts below pings from blogs average around
8K per hour and those from splogs average around 25K. </p> 
<center>
<div align="center" style="width:80%;align:center;border-style:groove;padding:5px" >
<img src="http://memeta.umbc.edu/stats/ebb.ping.blog.7.png" alt="Blog Pings" /><br>
<b>25% of blog posts are from from legitimate blogs</b>
<p> </p>
<img src="http://memeta.umbc.edu/stats/ebb.ping.splog.7.png" alt="Splog Pings" /><br>
<b>75% of posts are from from spam blogs</b>
</div>
</center>
<p>
Of the 14 million sources which pinged <a href="http:weblogs.com"> weblogs.com </a> during the study, splogs made up more than half, shown in this graph.
<p>
<center>
<div align="center" style="width:80%;align:center;border-style:groove;padding:5px" >
<img src="http://www.cs.umbc.edu/~finin/images/splogpie.png" alt="Pings by Source" /><br>
</div>
</center>

<p> A <a href="http://ebiquity.umbc.edu/paper/html/id/269/">paper</a>
on Memeta will be presented in March at the AAAI Spring Symposium on
<a
href="http://www.umbriacom.com/aaai2006_weblog_symposium/">Computational
Approaches to Analyzing Weblogs</a>.  Hourly data from Memeta is available <a href="http://memeta.umbc.edu/">online</a>.  For more information, contact <a href="mailto:memeta@ebiquity.umbc.edu">memeta@ebiquity.umbc.edu</a>.</p>


<p> The techniques being explored by Kolari and others at the <a
href="http://ebiquity.umbc.edu">UMBC ebiquity research group</a> can
be used by web and blog search engines such as Google and Technorati
to identify posts from splogs and remove them from their search
results. </p> 
]]></news:description>
    <news:uri><![CDATA[http://memeta.umbc.edu/]]></news:uri>
  </news:News>

  <rdf:Description rdf:about="">
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
  </rdf:Description>

</rdf:RDF>
