UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
16 May 2008, 23:36:21 EDT  
splog

Archive for the 'splog' Category

No spam on Twitter?!

February 25th, 2008, by Tim Finin, posted in Twitter, Social media, splog, Blogging, Web

Can it be true? Russell Beattie posts that on Twitter there are nearly a million users, and no spam or trolls. Spam does exist on Twitter, of course, but it does seem to be less of a problem than on the Blogosphere, Web or email. Maybe it’s because that search engines don’t treat tweets like Web pages or blog posts.

Sifry’s state of the blogosphere

February 6th, 2006, by Tim Finin, posted in splog, memeta, Blogging

Technorati’s David Sifry has posted another State of the Blogosphere report with lots of interesting statistics. Highlights include

  • Technorati tracks 50K posts and hour from 27M blogs.
  • The number of blogs doubles evey six months.
  • Splogs and spings are increasing.
  • Tagging is increasingly popular.

Half of Swoogle’s hits are from referer log spammers

February 4th, 2006, by Tim Finin, posted in splog, Swoogle, Blogging, Web, Semantic Web

We are using bbclone to generate reports on Swoogle access. Look at today’s top 10 referers as of 3:00pm:

  www.legaladvocate.net  246     26.14%
  www.myjavaserver.com   152     16.15%
  www.google.com         125     13.28%
  dannyayers.com         44      4.68%
  lucky7.to              34      3.61%
  ebiquity.umbc.edu      25      2.66%
  www.google.de          18      1.91%
  planetrdf.com          18      1.91%
  mail.google.com        18      1.91%
  groups.google.com      14      1.49%

One and five are clearly spam sites and two is suspicious, too. The first, for example, appears to be about poker, though the site name is legaladvocat. The site’s text is obviously automatically generated nonsense. All of the links point to subpages in the same domain with a similar structure and content. I assume that once the site achineves a high pageRank, it will be repurposed or sold.

So, it seems like nearly 50% of our hits are due to referer log spamming. I’d guess Swoogle was picked by finding its URL on recent posts found on a blog search engine or a ping server.

Splogs, like spam, will be with us for a while

January 24th, 2006, by Tim Finin, posted in splog, Blogging, Web, GENERAL

Two years ago Bill Gates predicted that the spam problem would be solved by now, as this article in The Register reports.

Hey Bill, why am I still getting spam?
Junk mail outlives MS mortality prediction
By John Leyden, 24 January 2006

Two years ago today Bill Gates predicted that spam email would be eradicated as a problem within 24 months. The Microsoft chairman predicted the death of spam in a speech at the World Economic Forum on 24 February 2004.

Gates outlined a three-stage plan to eradicate spam within two years. Microsoft’s scheme calls for better filters to weed out spam messages and sender authentication via a form of challenge-response system. Secondly, Microsoft wants to see to a form of tar-pitting so that emails coming from unknown senders are slowed down to a point where bulk mail runs become impractical.

Lastly, and most promisingly as far as Gates is concerned, is a digital equivalent of stamps for email, to be paid out only if the recipient considers an email to be spam. Blocking spam email would appear to be a simple problem but in practice is far trickier than Gates, or indeed the industry, first thought.

It’s tempting to think that we are close to being able to solve the splog identification problem, which enable blog search engines to weed the slogs out of their indices. But, I’ll bet that splogs will be with us for a long time, as is the case with spam. Of course, we do have to work hard to keep them under control, just as we do with spam. If we don’t, the blogosphere will be quickly overrun and its promise squandered.

UMBC blog research on splogs in Baltimore Sun

January 17th, 2006, by Tim Finin, posted in splog, Ebiquity, Blogging, Web, GENERAL

Baltimore Sun’s Troy McCullough talks about Pranam Kolari’s work on detecting splogs in his column on Sunday, 15 January 2006. The column also has an associated podcast.

Fighting spam sites - latest battle in the blog wars
On Blogs: Troy McCullough, Jan 15, 2006

It seems that everyone has a blog these days - a spot that others can visit to find out what they have to say about something or nothing in particular. Some blogs are widely valued fonts of specialized wisdom, but many are viewed as uninteresting expressions of personal ego. The difficulty of sorting the good blogs from the bad can be a frustrating challenge - one that is seen as a serious threat to what has been viewed as a vital feature of the Internet.

Now, three University of Maryland, Baltimore County researchers have made a far more disturbing conclusion about blogs. After analyzing millions of blog posts, they have determined that the blogosphere is drowning in spam, the pejorative nickname given to unsolicited Internet advertising. Using data collected by weblogs.com, a prominent blog tracking service, doctoral student Pranam Kolari and professors Tim Finin and Anupam Joshi analyzed 40 million blog updates submitted from 14 million blogs.

Welcome to the Splogosphere: 75% of new pings are spings (splogs)

December 15th, 2005, by Pranam Kolari, posted in Blogging, memeta, splog, Technology, Web, Semantic Web, Machine Learning, GENERAL

In the blogosphere, pings are notifications sent by updated blogs to PingServers. A major issue recently has been unjustified pings, also known as Spings, sent by Splogs. Splogs have been discussed a lot recently, including an interesting thread on post piracy that Steve Rubel initiated on Micropersuasion.

The problem of splogs prompted us to analyze pings from weblogs.com, which publishes hourly pings as changes.xml. We have been collecting these pings over the last 4 weeks for a total of 40 million pings from around 14 million (so claimed) blogs. To begin with, we applied a language identification technique implemented by James Mayfield to identify language by fetching these blogs. As expected most of the pings were from blogs authored in English. But we were able to identify blogs from many other languages as well. For instance, charts below show a distribution of pings from blogs authored in Italian — over a day and over a week. Each bar denotes the number of pings per hour.


Pings over a day
Pings over 8 days

All times are in GMT; clearly Italian authored blogs display a specific blogging pattern.

In the next step we used our work on splog detection to detect splogs (and hence spings) among the english blogs. Our detection mechanism is close to 90% accurate. As shown in the charts below pings from blogs average around 8K per hour and those from splogs average around 25K.


Blog Pings
Splog Pings

Clearly almost 3 out of 4 pings are spings! Going back further to the source of these spings, we observed that more than 50% of claimed blogs pinging weblogs.com are splogs.

Based on the interestingness of this preliminary statistics, scope for further analysis and interest in the resulting dataset we decided to continuosly monitor the pingosphere. So, we now do it “live” on updated blogs published by weblogs.com(delayed by an hour), and have made it publicly available at http://memeta.umbc.edu. The site lists blogging patterns for many other languages, and compares splogs with blogs. All of our work is part of a larger project memeta, towards analyzing the content and structure of the blogosphere.

We hope our effort is a good complement to existing services (e.g., FightSplog, SplogReporter and SplogSpot) towards combating splogs. We currently publish only simple ping statistics on this site, but do stay tuned for fresh splog and classified blog dumps and much more!

UPDATE: Matthew Hurst from BlogPulse points us to an interesting analysis he has done on a day of weblogs.com pings.

You are currently browsing the archives for the splog category.

  Home | Archive | Login | Feed

Recent posts

  • The Psychology of Social Networking on KQED Forum show
  • Students: brand yourself with a blog
  • Social Data on the Web workshop at ISWC 2008
  • Petrini: Streaming Applications on the Cell BE Processor, 3pm 5/13 UMBC
  • Gossip-Based Outlier Detection for Mobile Ad Hoc Networks

  • Ebiquity community

  • Fieldmarking data blog
  • Geospatial Semantic Web
  • Harry Chen thinks aloud
  • Planet social media research
  • Social media research blog
  • TrackForward by Kolari
  • UMBC GAIM

  • UMBC