Twitter, twitter, twitter..

April 26th, 2007

Just couldn’t resist reposting this –

GapingVoid, via kellypuffs.

Spam in Blogs and Social Media

April 3rd, 2007

We recently presented a tutorial on Spam in Blogs and Social Media at ICWSM.

Spam on the Internet dates back over a decade, with its earliest known appearance as an email about the infamous MAKE.MONEY.FAST. campaign. Spam has co-evolved with Internet applications and is now quite common on the World-Wide Web.

As social media systems such as blogs, wikis and bookmark sharing sites have emerged, spammers have quickly developed techniques to infect them as well. The very characteristics underlying the Web, be it version 1.0, 2.0 or 3.0, also enable new varieties of spam.

This tutorial will detail the problem of spam in social media, with an emphasis on spam in blogs. We will discuss different types of spam, the motivation for spammers and the seriousness and nature of the problem. We will then share some of our discussions with groups facing and tackling this issue. The second half of the tutorial will present an overview of spam detection and elimination efforts with an extensive survey of detecting blog spam. The tutorial will help researchers understand the unique aspects of spam in social media, recognize how it affects analytics and information extraction, and identify current research challenges. For practitioners, it will provide strategies for controlling the problem, and identify areas of collaboration across communities.

We have made the slides available online in PDF. As always, we welcome suggestions and/or comments.

Hitwise on Fast Growing Social Networks – Implications

March 14th, 2007

Hitwise is reporting numbers on social network usage among Web users. This is what stood out:

The market share of visits to the custom category of the top 20 social networking sites increased by 11.5% from January 2007 to February 2007. Year-over-year (February 2006 – February 2007) category traffic was up 87%.

This leads to an interesting question on evolving behavior of Web users. At any given point of time, consider user attention to be at one of these categories of content:

  1. Social Networking Sites
  2. Commerce Sites
  3. Feed Readers
  4. Social Content (Blogs, Wikipedia etc..)
  5. Contextual Advertisements
  6. Organic Search Results
  7. Rest of the Web

It’s well known that traffic to the first five categories is either growing or stable. So which of the last two categories is this growth biting into? Either case, we might soon see a headliner from Hitwise that goes — “Less users searching on the Web”, or something similar.

So what does this mean to Google et al — of course less revenue from self-hosted ads, and consequently reduced margins. Solution — Buy Social Networking Sites and offer new services, a trend that will (and better) continue.

(Via Micropersuasion)

eBiquity now on Twitter

March 6th, 2007

The growth of Twitter has been phenomenal over the last couple of months. While its utility is argued by some, current traction suggests this could be another Web 2.0 winner.

Though we were initially circumspect (as were many others), we decided to take the plunge last week. See what we are upto now on our blog sidebar, or follow us directly at twitter.

The I’s of the Blogosphere

February 25th, 2007

The token “I” (1, 2) can provide interesting cues on the Blogosphere, other than signifying the obvious personal nature of blog posts. “I” sometimes use it to study the growth of the blogosphere (between David Sifry reports ofcourse), or just for fun to see how frequently indices of blog search engines are updated and if any of them are in a “breather” mode.

Two charts on the distribution of “I” in blog posts, one from BlogPulse and the other from Technorati.

BlogPulse reports that around 45% of all postings feature an “I”. Technorati indexes around 400000 posts featuring “I” per day. Merging the two data points Technorati indexes around 900000 posts per day, or rather around 40000 posts per hour, a number which has seen no change for almost a year. Nothing new here, the English blogosphere has plateaued. What’s confusing is that this analysis does not correlate with David SIfry’s number from October 2006, with around 1.3 Million postings per day, putting off my analysis by around 50%. What am I missing here?

As an aside this brings to question the growth of blogs in non-US English speaking geographies, India for instance.

Of course the same analysis can be done with other keywords, but neither of them give the coverage , nor are they as temporally independent as “I”. Any other interesting uses of buzz charts?

AIRWeb 2007: Papers Due February 23

February 11th, 2007

AIRWeb 2007 is third in a series of workshops on Adversarial Information Retrieval on the Web. This year the workshop also features a web spam challenge.

AIRWeb is a series of international workshops focusing on Adversarial Information Retrieval on the Web that brings together both researchers and industry practitioners, to present and discuss advances in the state of the art. This year, AIRWeb’2007 will be co-located with the WWW’07 conference in Banff, Canada. The workshop will include a Web Spam challenge that will test different spam detection techniques on a shared reference collection.

The call for papers lists a interesting set of problems, including new one’s like malicious tagging.

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging

Web spam is one area where research is highly influenced by discussions with practitioners. With sponsorship and involvement from industry leaders, this should be a great venue to seek inputs. We plan to submit a paper on our continuing work on splogs.

Pings, Spings, Splogs and the Splogosphere: 2007 Updates

February 1st, 2007

We present some updates on the Splogosphere as seen at a pingserver ( This follows our study from a year earlier which reported on splogs in the English speaking blogosphere. Our current update is based on 8.8 million pings on between January 23rd and January 26th. Though not fully representative, it does give a good sense of spam in the indexed blogosphere.

(i) 53% of all pings is spam, 64% of all pings from blogs in English is spam. A year earlier we found that close to 75% of all pings from English blogs are spings. Dave Sifry reported on seeing 70% spings in his last report. Clearly the growth of spings has plateaued, one less thing to worry about.

(ii) 56% of all pinging blogs are spam. By collapsing these pings to their respective blogs, we chart the distribution of authentic blogs against splogs. These numbers have seen no change, 56% of all pinging blogs are splogs.
(iii) MySpace is now the biggest contributor to the blogosphere. The other key driver LiveJournal and blogs managed by SixApart (as seen at their update stream) contribute only 50-60% of what MySpace does. The growth of MySpace blogs has in fact dwarfed the growth of splogs! Further if MySpace is discounted in our analysis close to 84% of all pings are spings! Though MySpace is relatively splog free, we are beginning to notice splogs, something blog harvesters should keep an eye on. [Note that not all blogspot blogs ping]
(iv) Blogspot continues to be heavily spammed. Most of this spam however is now detected by blog search engines, a point also shared by Matt Cutts and Randy Morin. In all of the pings we processed, 51% blogspot blogs were spam!

(v) Most spam blogs are still hosted in the US. We ranked IPs associated with spam blogs based on their frequency of pings, and located them using ARIN.

1. Mountain View, CA
2. Washington DC
3. San Francisco, CA
4. Orlando, FL
5. Lansing, MI

Blogspot hosts the highest number of splogs, but we also found that most of the other top hosts where physically hosted in the US. Perhaps Jonathan Bailey knows more about the legal ramifications.

(vi) Content on .info domain continues to be a problem. 99.75% of all blogs hosted on these domains are spam. In other words 1.65 Million blogs were spam as opposed to only around 4K authentic blogs! As long these domains are cheap and keyword rich this trend is likely to continue. Sploggers are also exploiting private domain registration services (see here).

(vii) High PPC contexts remain the primary motivation to spam. We identified the top keywords associated with spam blogs and generated a tag cloud using keyword frequency.

***** auto big buy california cancer card casino cheap college consolidation credit debt diet digital discount dvd equipment estate finance florida forex free furniture gift girls golf health hotel info insurance jewelry lawyer loan loans medical money mortgage new online phone poker rental sale school *** small software texas **** trading travel used vacation video wedding

We link these keywords to to depict an emerging problem that is quickly becoming serious. We posted on this recently, though references date to quite a while back. [See related tag spam notes on MyWeb, Technorati and]

We will continue our effort on tackling spam. Our ongoing research on spam is catalogued in our tagged splog resources, or better still check out our tutorial at ICWSM this March!

Tag Spam on the Rise

January 24th, 2007

I subscribe and follow keywords of interest through RSS feeds, both on blog search engines and bookmarking tools. Though splogs have always been a problem, lately I have noticed increasing spam in bookmarking tools. What do we call it — b00kmarks? (read zero, zero)

In the more popular one’s (like the LONG TAIL is highly compromised (1, 2), while in the less popular even the HEAD seems to have problems. The availability of many ready to use “tag and ping” tools is making things worse.

While my immediate response is to unsubscribe, being researchers we will of course investigate this further.

Splogs in the Non-English Blogosphere

January 20th, 2007

We are conducting research on the nature and seriousness of the splog problem in the non-English blogosphere. As contextual advertisements and affiliate marketing become more profitable in these other languages, splogs are bound to infiltrate and pollute them. We suspect its already beginning to, in a limited way, and are interested in studying them.

From the research community, the only work related to non-English splogs is:

Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies, by Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda and Masayuki Takeda
Most of this work is however based on synthetic data, not actual splogs.

We have also made attempts to see the existence of splogs by querying blog search engines using translated spammy(profitable) advertising contexts like insurance, vacations, loans etc. Cultural differences indicate that this might not be the way to go. True, we haven’t come across many splogs.

This is what prompts us to seek suggestions from the blogging and research community. If you know of this problem, have seen splogs in other languages, know of spammy non-English advertising contexts, and would like to contribute or collaborate please send either of us a note or comment below.

UPDATE: For our readers, Is this a splog in Japanese?

Pew on Social Network Usage

January 7th, 2007

Pew Internet just released a survey on social network usage among American teens. While confirming the known, (i) MySpace is the most visited site (ii) girls are more active than boys, the report goes on to say that “74% of respondents post comments on friend’s blog”. I would have loved to see numbers on “x% of respondents have created or plan to create blog posts”.

MySpace has been a major contributor to the growth of blogs recently, and now forms a major chunk of a blog search result. The last we checked MySpace blogs were contributing 15% to 20% of all pings to, with numbers rising by the day.

Top Viewed Publications at Ebiquity in 2006

January 4th, 2007

Listed below are some of our top viewed papers in 2006:

  1. XPod a human activity and emotion aware mobile music player
  2. Swoogle: A Search and Metadata Engine for the Semantic Web
  3. SOUPA: Standard Ontology for Ubiquitous and Pervasive Applications
  4. SVMs for the Blogosphere: Blog Identification and Splog Detection
  5. Finding and Ranking Knowledge on the Semantic Web
  6. Detecting Spam Blogs: A Machine Learning Approach
  7. Information Retrieval and the Semantic Web
  8. An Intelligent Broker Architecture for Pervasive Context-Aware Systems
  9. Social Networking on the Semantic Web
  10. Secure Routing and Intrusion Detection in Ad Hoc Networks

This maps well into the broad areas we work on. If you are interested, our group website also allows viewing papers by cumulative citations, and downloads.

Top Blog Posts and Referrers for 2006

December 28th, 2006

Bloggers have been publishing their top viewed posts for this year. Here’s our contribution to the Blogosphere, top ten pages, posts or categories viewed in 2006:

  1. Splog Software From Hell
  2. Posts on Swoogle
  3. 100 Most common RDF Namespaces
  4. ICWSM 2007 Blogs Dataset
  5. Welcome to the Splogosphere
  6. EZ Google maps for your web page
  7. Untangling ontologies on the Semantic Web
  8. Thieves use Bluetooth to find laptops to steal
  9. Big OWL documents on the Semantic Web
  10. Big RDF documents on the Semantic Web

While we are pleased that we created useful content, we would also like to acknowledge referrers who helped reach our audience.

  1. Google
  2. Stumble Upon
  3. Yahoo
  4. Swoogle
  5. Wikipedia
  7. Slashdot
  8. MSN
  9. Technorati
  10. DIGG that order, with Google contributing more than 50%.