Blog comment spam with plagiarized text: hard to spot
Tim Finin, 1:00pm 10 August 2006An easy way to boost a Web page’s rank is to leave comments on blog posts with a link back to your page. Popular blogging systems let the commenter associate a link back to her Web page even if there are no links in the comment body, which are often disallowed. So a typical spam comment has, in addition to the commenter’s name and link, some text. This text comes in many varieties, including:
- Obvious come ons: Free ringtones!
- Generic statements: Looking for information and found it at this great site.
- Markovian delights: Uruguay’a sunlit straightened avers compellingly levelly
- Random letters: sherthul axrteg hioqurtch
Today I noticed that someone tried to post a comment on an ebiquity post on large RDF documents:
Name: Thommes | E-mail: info@linkfeed.de | URI: http://www.linkfeed.de | IP: 145.254.226.204 | Date: 8/9/2006
The future for RDF for learning technology specifications is bright, and the possibilities opened up by RDF and Semantic Web technologies promise to take learning technology project to a new level of applications.
It’s quite relevant to our post, but not to the commenter’s Web site. Google reveals that the comment text was plagiarized from the Web. I have not been able to reconstruct the strategy that the spammer or his bot must have used. Does it start with a random sentence selected from the web and then use a feed or blog search engine to find posts to comment on? Or does it find a post to comment on and then search on the post’s title to find a Web page from which to plagiarize a sentence?
In any case, this kind of comment spam is going to be hard to recognize automatically. Catching this requires correctly identifying the commenter’s link as pointing to a Web spam page. And for me, verifying that the comment text was plagiarized convinced me that the target was spam.

December 15th, 2007 at 7:25 am
You raise an interesting point here; and I may have an explanation for you.
First of all, in the webmaster/seo/sem (or what have you) world, back-links from .edu domains are really important (Google for example gives a lot of weight to these). Forgive me if you already knew this and I’m just stating the obvious but this brings me to my actual point.
These kind of links are really hard to get for the simple reason that us search engine marketers generally don’t have websites in niches that identify with the .edu’s we’re trying to get back-links from (ie. my website for example is geared towards myspace customization … absolutely no tie what so ever to this here blog).
Now, amongst sem’s, that actually know what they’re doing, it’s common knowledge that the only way to easily get back-links from .edu’s is to sit down and post relevant content to said .edu (either through a comment or by submitting and article or whatever) — this would be impossible to catch through comment spam systems such as askimet (you might want to check that out btw: http://akismet.com) and the relevancy of the content would ensure that the submission would get passed any sort of moderation.
I’m pretty confident that the comment you’re referring to in your article has been submitted by an actual person reading your blog — in the spirit of the technique I just mentioned, or perhaps with no interest other than to contribute his opinion.
My conviction that it’s an actual person and not a spam bot is because the majority of spammers are not sophisticated enough to employ semantic and contextual mechanisms to identify your blog’s topic and then scrape relevant and copyrighted content and run in through a Markov chain algorithm and then post it. Than again, some of them are
However, the question I would ask my self if I would be in your position is ”do I want to delete this relevant content for?”. Indeed some people may have posted that just to get a link, and it may or may not be spam that was generated through a complex contextual mechanisms — but if it’s relevant, does it really matter?
Also, now that I think of what I just told you in my comment here in contrast to my sites topic, you may see this post as being just an attempt to get a link
December 25th, 2007 at 9:49 pm
I agree with the point that majority of blog comment spam with plagiarized text is not intentional. So it would be unfair to blame spam bots trying to spam blog comments because sophisticated technology is absent in them to identify a blog’s topic and posting copyrighted content matching the theme. Since very few websites are present in niches that identify with the .edu’s , an actual commenter ignores the usual practice of posting and posts relevant comment to the blog only after reading it with an intention to get backlinks.
April 6th, 2008 at 12:48 am
google
April 9th, 2008 at 11:11 am
I agree with you that the comment was likely copied off the web, but it is not likely automated. It probably was manually submitted by a live person and was pasted in by the person posting. I have had several of my writings and articles appear on other web sites with people passing off my work (or portions thereof) as their own. It is a shame people are not original in their writings, and then must use other people’s writings to post on other sites to get attention.
April 10th, 2008 at 2:11 pm
It is hard to tell if it was a bot or it was a human being that was behind the spamming. If I were to put my money on it, I would say it was a person. People will do anything these days to get more reccognition to their website and it doesn’t matter who they walk over to get it. It is unfortunate that people have no respect for other poeple and their websites. Plagarism is huge on the internet and I have emailed a few people with cease and disist orders that have copied me. What a shame.
May 11th, 2008 at 12:18 pm
Very thx…
“It is hard to tell if it was a bot or it was a human being that was behind the spamming. If I were to put my money on it, I would say it was a person. People will do anything these days to get more reccognition to their website and it doesn’t matter who they walk over to get it. It is unfortunate that people have no respect for other poeple and their websites. Plagarism is huge on the internet and I have emailed a few people with cease and disist orders that have copied me. What a shame.”
++1
May 12th, 2008 at 10:54 am
This is like a strategy game to pass moderation test without failure. I would also say this is a real person that tries to put up some quality texts from the web that would match the topic as close as possible. But plagarism is another issue of the internet as stated by other commenters here, you are stealing not only people’s hard work but their passion to write original staff. Maybe someone could find a way to spot this kind of spam comments and make a plugin for public use. It would be a very nice job for programmers.
May 12th, 2008 at 4:37 pm
I encounter this frequently with some of my websites and with articles I have written that allow comments and I do not own or post on any .edu domains.
I’ve found that the only way I can combat the comment spam is by filtering responses before they ever make it to being posted.
Stephanie
May 12th, 2008 at 4:40 pm
By the way, I found it interesting that after my first comment posted it took me back up to the top of the screen that showed this sites google ads. Two of the google links were adult in nature haha. Adult Blog and Bad Girls Blog. You may want to mention to whomever maintains the Google Adsense on this account if it is not yourself to disallow adult websites from advertising on a university webpage.. just a thought.
June 3rd, 2008 at 7:59 am
Nice Read.. I used to remove all those spam comments manually at first then I got tired and started using some tools like Askimet and all. Mostly, the bots are doing all the spam comments and they learn new tricks every day. Like they’ll take a piece of the topic and say.. “I couldn’t understand the last part” where there isn’t any last part! it’s a picture :]
Now, to prevent spam perfectly, either moderate everything or let Askimet + manual work handle the spammers.
MoiN
June 4th, 2008 at 8:43 am
Akismet is good for wordpress, I hope if there any similar stuff for blogspot. Do you believe there is people using some software create hundred comment for spam blog?
June 5th, 2008 at 12:47 pm
I’ve the same spam problem too. Fortunately on blogspot I’ve never recieved spam ( I’m a lucky guy
). You’re Akismet is a great solution for wordpress. But for blogspot? Is there a way to prevent spam? ( if it comes… )
July 1st, 2008 at 8:14 am
[...] or comments. Here’s an example from today’s batch, a comment on a two-year old post Blog comment spam with plagiarized text: hard to spot from cameroun trying to promote the site africapresse.com. “spam is a real problem in this [...]
July 31st, 2008 at 5:09 pm
I wasn’t going to reply to this as the post is quite old. However I see the discussion is still on going. The spammers are tweaking their bots and utilizing advanced techniques to completely mask their true intentions. I run a social bookmarking website in which spammers have introduced a new program that runs through proxies, creates unlimited accounts, spams thousands of social bookmarking websites in one run, breaks captcha codes with ease, and generally ruins the entire community with irrelevant complete black list trash websites in a matter of hours. You would think the people with the talent to create such programs would use their skills on more productive projects that actually helped fix a problem, instead of create new ones. One word.. greed. They are charging almost $200 dollars for this program, and the creators are going to collect quite a sum of money for ruining other peoples hard work.
August 5th, 2008 at 1:24 pm
i am using a challenge plugin for my wordpress to avoid spam comments, Askimet is not enough to blog all the spam
August 18th, 2008 at 10:55 am
I still havent seen an answer to the question raised earlier of it is relevant and is not inflammatory, do you not run the risk of deleting valid commentary? As to human or bot, there should be a new category of hubot which would include those folk of third world status who, while not bots are also not fully participatory in the forums (i.e; not human). As to duplicate checkers, all it takes is rewriting in a small way to beat them.
September 2nd, 2008 at 12:00 pm
It seems that in the fight against spam, short pointless comments for backlinks and now on topic plagiarized comments we will all have to find ways that are unobtrusive to normal real readers, and at the same time devise a way that is better than captcha but still involves the need for a human element.
If not then I guess I am doomed to answering what 4+4 is or if something is hot or cold, and what was that last one? Am I human? Are you serious?
I think we need to hold a conference on this and put pens to paper and see what we can come up with. Captcha is busted and we need something better.
September 8th, 2008 at 5:24 pm
At first I thought it looked kind of natural to me then the repetitive nature of *RDF* and *Technology* seem to me that they may have been the anchor keywords this bot was trying to gain popularity in. The scary thought is these bots can be out sourced on places like rent a coder and built for about $100, probably by a person who’s first language is not even English, and here we are debating if its real or automated. Kinda funny.
September 13th, 2008 at 11:25 am
I keep seeing comment spam that looks more and more like real comments. Soon the technology the spammers use will be so good, it will get by the best filters.
September 30th, 2008 at 3:23 pm
Definitely an interesting debate as to human or not, the act of these apparent spammers. I have seen blogs with so many comments that it is hard (except for th obvious – great post – nice post – where did you get this template, etc) to sometimes identify the good from the bad. Hard work, but comments can always be moderated, or trashed even if has been published. Only filter I know that actually works.
October 1st, 2008 at 3:16 pm
I think there’s a physical spammer behind that kind of comments. Bots aren’t that sophisticated…
October 17th, 2008 at 5:47 pm
This is an interesting situation. We don’t know if the post was done by a person or a computer.
Alan Turing said that real artificial intelligence exists only if the output from a computer is indistinguishable from a human response.
Are we there yet?
November 19th, 2008 at 5:54 pm
I had the same situation too, Sorry for my bad english but i am german. all i can understand was interesting, nice post cu
November 19th, 2008 at 6:35 pm
Hi there. I recently was reading a blog post on spamming. The writer mentioned your blog post and that’s how I came across this post. I too am a webmaster however I must disagree with this comment made by Gravity:
” … Gravity Says: August 5th, 2008 at 1:24 pm
i am using a challenge plugin for my wordpress to avoid spam comments, Askimet is not enough to blog all the spam
…”
I have been emailed by Wordpress to moderate just three spam issues. About 98% of all my spam is caught automatically with Askimet. I love the Akimet WP Plug-in!
Daniel,
Sidney, BC