Just couldn’t resist reposting this –

GapingVoid, via kellypuffs.
We recently presented a tutorial on Spam in Blogs and Social Media at ICWSM.
Spam on the Internet dates back over a decade, with its earliest known appearance as an email about the infamous MAKE.MONEY.FAST. campaign. Spam has co-evolved with Internet applications and is now quite common on the World-Wide Web.
As social media systems such as blogs, wikis and bookmark sharing sites have emerged, spammers have quickly developed techniques to infect them as well. The very characteristics underlying the Web, be it version 1.0, 2.0 or 3.0, also enable new varieties of spam.
This tutorial will detail the problem of spam in social media, with an emphasis on spam in blogs. We will discuss different types of spam, the motivation for spammers and the seriousness and nature of the problem. We will then share some of our discussions with groups facing and tackling this issue. The second half of the tutorial will present an overview of spam detection and elimination efforts with an extensive survey of detecting blog spam. The tutorial will help researchers understand the unique aspects of spam in social media, recognize how it affects analytics and information extraction, and identify current research challenges. For practitioners, it will provide strategies for controlling the problem, and identify areas of collaboration across communities.
We have made the slides available online in PDF. As always, we welcome suggestions and/or comments.
Hitwise is reporting numbers on social network usage among Web users. This is what stood out:
The market share of visits to the custom category of the top 20 social networking sites increased by 11.5% from January 2007 to February 2007. Year-over-year (February 2006 – February 2007) category traffic was up 87%.
This leads to an interesting question on evolving behavior of Web users. At any given point of time, consider user attention to be at one of these categories of content:
It’s well known that traffic to the first five categories is either growing or stable. So which of the last two categories is this growth biting into? Either case, we might soon see a headliner from Hitwise that goes — “Less users searching on the Web”, or something similar.
So what does this mean to Google et al — of course less revenue from self-hosted ads, and consequently reduced margins. Solution — Buy Social Networking Sites and offer new services, a trend that will (and better) continue.
(Via Micropersuasion)
The growth of Twitter has been phenomenal over the last couple of months. While its utility is argued by some, current traction suggests this could be another Web 2.0 winner.
The token “I” (1, 2) can provide interesting cues on the Blogosphere, other than signifying the obvious personal nature of blog posts. “I” sometimes use it to study the growth of the blogosphere (between David Sifry reports ofcourse), or just for fun to see how frequently indices of blog search engines are updated and if any of them are in a “breather” mode.
Two charts on the distribution of “I” in blog posts, one from BlogPulse and the other from Technorati.
BlogPulse reports that around 45% of all postings feature an “I”. Technorati indexes around 400000 posts featuring “I” per day. Merging the two data points Technorati indexes around 900000 posts per day, or rather around 40000 posts per hour, a number which has seen no change for almost a year. Nothing new here, the English blogosphere has plateaued. What’s confusing is that this analysis does not correlate with David SIfry’s number from October 2006, with around 1.3 Million postings per day, putting off my analysis by around 50%. What am I missing here?
As an aside this brings to question the growth of blogs in non-US English speaking geographies, India for instance.
Of course the same analysis can be done with other keywords, but neither of them give the coverage , nor are they as temporally independent as “I”. Any other interesting uses of buzz charts?
AIRWeb 2007 is third in a series of workshops on Adversarial Information Retrieval on the Web. This year the workshop also features a web spam challenge.
AIRWeb is a series of international workshops focusing on Adversarial Information Retrieval on the Web that brings together both researchers and industry practitioners, to present and discuss advances in the state of the art. This year, AIRWeb’2007 will be co-located with the WWW’07 conference in Banff, Canada. The workshop will include a Web Spam challenge that will test different spam detection techniques on a shared reference collection.
The call for papers lists a interesting set of problems, including new one’s like malicious tagging.
* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging
Web spam is one area where research is highly influenced by discussions with practitioners. With sponsorship and involvement from industry leaders, this should be a great venue to seek inputs. We plan to submit a paper on our continuing work on splogs.
We present some updates on the Splogosphere as seen at a pingserver (weblogs.com). This follows our study from a year earlier which reported on splogs in the English speaking blogosphere. Our current update is based on 8.8 million pings on weblogs.com between January 23rd and January 26th. Though not fully representative, it does give a good sense of spam in the indexed blogosphere.
(i) 53% of all pings is spam, 64% of all pings from blogs in English is spam. A year earlier we found that close to 75% of all pings from English blogs are spings. Dave Sifry reported on seeing 70% spings in his last report. Clearly the growth of spings has plateaued, one less thing to worry about.
(v) Most spam blogs are still hosted in the US. We ranked IPs associated with spam blogs based on their frequency of pings, and located them using ARIN.
1. | Mountain View, CA |
2. | Washington DC |
3. | San Francisco, CA |
4. | Orlando, FL |
5. | Lansing, MI |
(vi) Content on .info domain continues to be a problem. 99.75% of all blogs hosted on these domains are spam. In other words 1.65 Million blogs were spam as opposed to only around 4K authentic blogs! As long these domains are cheap and keyword rich this trend is likely to continue. Sploggers are also exploiting private domain registration services (see here).
(vii) High PPC contexts remain the primary motivation to spam. We identified the top keywords associated with spam blogs and generated a tag cloud using keyword frequency.
We will continue our effort on tackling spam. Our ongoing research on spam is catalogued in our tagged splog resources, or better still check out our tutorial at ICWSM this March!
I subscribe and follow keywords of interest through RSS feeds, both on blog search engines and bookmarking tools. Though splogs have always been a problem, lately I have noticed increasing spam in bookmarking tools. What do we call it — b00kmarks? (read zero, zero)
In the more popular one’s (like del.icio.us) the LONG TAIL is highly compromised (1, 2), while in the less popular even the HEAD seems to have problems. The availability of many ready to use “tag and ping” tools is making things worse.
We are conducting research on the nature and seriousness of the splog problem in the non-English blogosphere. As contextual advertisements and affiliate marketing become more profitable in these other languages, splogs are bound to infiltrate and pollute them. We suspect its already beginning to, in a limited way, and are interested in studying them.
From the research community, the only work related to non-English splogs is:
Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies, by Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda and Masayuki Takeda
Most of this work is however based on synthetic data, not actual splogs.
We have also made attempts to see the existence of splogs by querying blog search engines using translated spammy(profitable) advertising contexts like insurance, vacations, loans etc. Cultural differences indicate that this might not be the way to go. True, we haven’t come across many splogs.
This is what prompts us to seek suggestions from the blogging and research community. If you know of this problem, have seen splogs in other languages, know of spammy non-English advertising contexts, and would like to contribute or collaborate please send either of us a note or comment below.
UPDATE: For our readers, Is this a splog in Japanese? http://diet.newstanding.com/hcm/vpb/
Pew Internet just released a survey on social network usage among American teens. While confirming the known, (i) MySpace is the most visited site (ii) girls are more active than boys, the report goes on to say that “74% of respondents post comments on friend’s blog”. I would have loved to see numbers on “x% of respondents have created or plan to create blog posts”.
MySpace has been a major contributor to the growth of blogs recently, and now forms a major chunk of a blog search result. The last we checked MySpace blogs were contributing 15% to 20% of all pings to weblogs.com, with numbers rising by the day.
Listed below are some of our top viewed papers in 2006:
This maps well into the broad areas we work on. If you are interested, our group website also allows viewing papers by cumulative citations, and downloads.
Bloggers have been publishing their top viewed posts for this year. Here’s our contribution to the Blogosphere, top ten pages, posts or categories viewed in 2006:
While we are pleased that we created useful content, we would also like to acknowledge referrers who helped reach our audience.
..in that order, with Google contributing more than 50%.