UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
Blog comment spam with plagiarized text: hard to spot

Blog comment spam with plagiarized text: hard to spot

Tim Finin, 1:00pm 10 August 2006

An easy way to boost a Web page’s rank is to leave comments on blog posts with a link back to your page. Popular blogging systems let the commenter associate a link back to her Web page even if there are no links in the comment body, which are often disallowed. So a typical spam comment has, in addition to the commenter’s name and link, some text. This text comes in many varieties, including:

  • Obvious come ons: Free ringtones!
  • Generic statements: Looking for information and found it at this great site.
  • Markovian delights: Uruguay’a sunlit straightened avers compellingly levelly
  • Random letters: sherthul axrteg hioqurtch

Today I noticed that someone tried to post a comment on an ebiquity post on large RDF documents:

Name: Thommes | E-mail: info@linkfeed.de | URI: http://www.linkfeed.de | IP: 145.254.226.204 | Date: 8/9/2006

The future for RDF for learning technology specifications is bright, and the possibilities opened up by RDF and Semantic Web technologies promise to take learning technology project to a new level of applications.

It’s quite relevant to our post, but not to the commenter’s Web site. Google reveals that the comment text was plagiarized from the Web. I have not been able to reconstruct the strategy that the spammer or his bot must have used. Does it start with a random sentence selected from the web and then use a feed or blog search engine to find posts to comment on? Or does it find a post to comment on and then search on the post’s title to find a Web page from which to plagiarize a sentence?

In any case, this kind of comment spam is going to be hard to recognize automatically. Catching this requires correctly identifying the commenter’s link as pointing to a Web spam page. And for me, verifying that the comment text was plagiarized convinced me that the target was spam.

24 Responses to “Blog comment spam with plagiarized text: hard to spot”

  1. Myspace Graphics Says:

    You raise an interesting point here; and I may have an explanation for you.

    First of all, in the webmaster/seo/sem (or what have you) world, back-links from .edu domains are really important (Google for example gives a lot of weight to these). Forgive me if you already knew this and I’m just stating the obvious but this brings me to my actual point.
    These kind of links are really hard to get for the simple reason that us search engine marketers generally don’t have websites in niches that identify with the .edu’s we’re trying to get back-links from (ie. my website for example is geared towards myspace customization … absolutely no tie what so ever to this here blog).

    Now, amongst sem’s, that actually know what they’re doing, it’s common knowledge that the only way to easily get back-links from .edu’s is to sit down and post relevant content to said .edu (either through a comment or by submitting and article or whatever) — this would be impossible to catch through comment spam systems such as askimet (you might want to check that out btw: http://akismet.com) and the relevancy of the content would ensure that the submission would get passed any sort of moderation.
    I’m pretty confident that the comment you’re referring to in your article has been submitted by an actual person reading your blog — in the spirit of the technique I just mentioned, or perhaps with no interest other than to contribute his opinion.

    My conviction that it’s an actual person and not a spam bot is because the majority of spammers are not sophisticated enough to employ semantic and contextual mechanisms to identify your blog’s topic and then scrape relevant and copyrighted content and run in through a Markov chain algorithm and then post it. Than again, some of them are :)
    However, the question I would ask my self if I would be in your position is ”do I want to delete this relevant content for?”. Indeed some people may have posted that just to get a link, and it may or may not be spam that was generated through a complex contextual mechanisms — but if it’s relevant, does it really matter?

    Also, now that I think of what I just told you in my comment here in contrast to my sites topic, you may see this post as being just an attempt to get a link :)

  2. Ukwebco Says:

    I agree with the point that majority of blog comment spam with plagiarized text is not intentional. So it would be unfair to blame spam bots trying to spam blog comments because sophisticated technology is absent in them to identify a blog’s topic and posting copyrighted content matching the theme. Since very few websites are present in niches that identify with the .edu’s , an actual commenter ignores the usual practice of posting and posts relevant comment to the blog only after reading it with an intention to get backlinks.

  3. adam Says:

    google

  4. Adam Says:

    I agree with you that the comment was likely copied off the web, but it is not likely automated. It probably was manually submitted by a live person and was pasted in by the person posting. I have had several of my writings and articles appear on other web sites with people passing off my work (or portions thereof) as their own. It is a shame people are not original in their writings, and then must use other people’s writings to post on other sites to get attention.

  5. Sam Says:

    It is hard to tell if it was a bot or it was a human being that was behind the spamming. If I were to put my money on it, I would say it was a person. People will do anything these days to get more reccognition to their website and it doesn’t matter who they walk over to get it. It is unfortunate that people have no respect for other poeple and their websites. Plagarism is huge on the internet and I have emailed a few people with cease and disist orders that have copied me. What a shame.

  6. Tolga Çevik Says:

    Very thx…
    “It is hard to tell if it was a bot or it was a human being that was behind the spamming. If I were to put my money on it, I would say it was a person. People will do anything these days to get more reccognition to their website and it doesn’t matter who they walk over to get it. It is unfortunate that people have no respect for other poeple and their websites. Plagarism is huge on the internet and I have emailed a few people with cease and disist orders that have copied me. What a shame.”

    ++1

  7. Mert Says:

    This is like a strategy game to pass moderation test without failure. I would also say this is a real person that tries to put up some quality texts from the web that would match the topic as close as possible. But plagarism is another issue of the internet as stated by other commenters here, you are stealing not only people’s hard work but their passion to write original staff. Maybe someone could find a way to spot this kind of spam comments and make a plugin for public use. It would be a very nice job for programmers.

  8. Stephanie Says:

    I encounter this frequently with some of my websites and with articles I have written that allow comments and I do not own or post on any .edu domains.

    I’ve found that the only way I can combat the comment spam is by filtering responses before they ever make it to being posted.

    Stephanie

  9. Stephanie Says:

    By the way, I found it interesting that after my first comment posted it took me back up to the top of the screen that showed this sites google ads. Two of the google links were adult in nature haha. Adult Blog and Bad Girls Blog. You may want to mention to whomever maintains the Google Adsense on this account if it is not yourself to disallow adult websites from advertising on a university webpage.. just a thought.

  10. MoiN Says:

    Nice Read.. I used to remove all those spam comments manually at first then I got tired and started using some tools like Askimet and all. Mostly, the bots are doing all the spam comments and they learn new tricks every day. Like they’ll take a piece of the topic and say.. “I couldn’t understand the last part” where there isn’t any last part! it’s a picture :]

    Now, to prevent spam perfectly, either moderate everything or let Askimet + manual work handle the spammers.

    MoiN

  11. Halim Says:

    Akismet is good for wordpress, I hope if there any similar stuff for blogspot. Do you believe there is people using some software create hundred comment for spam blog?

  12. Enrico Says:

    I’ve the same spam problem too. Fortunately on blogspot I’ve never recieved spam ( I’m a lucky guy ;) ). You’re Akismet is a great solution for wordpress. But for blogspot? Is there a way to prevent spam? ( if it comes… )

  13. Blog comment spam magnet Says:

    [...] or comments. Here’s an example from today’s batch, a comment on a two-year old post Blog comment spam with plagiarized text: hard to spot from cameroun trying to promote the site africapresse.com. “spam is a real problem in this [...]

  14. LiveWire Says:

    I wasn’t going to reply to this as the post is quite old. However I see the discussion is still on going. The spammers are tweaking their bots and utilizing advanced techniques to completely mask their true intentions. I run a social bookmarking website in which spammers have introduced a new program that runs through proxies, creates unlimited accounts, spams thousands of social bookmarking websites in one run, breaks captcha codes with ease, and generally ruins the entire community with irrelevant complete black list trash websites in a matter of hours. You would think the people with the talent to create such programs would use their skills on more productive projects that actually helped fix a problem, instead of create new ones. One word.. greed. They are charging almost $200 dollars for this program, and the creators are going to collect quite a sum of money for ruining other peoples hard work.

  15. Gravity Says:

    i am using a challenge plugin for my wordpress to avoid spam comments, Askimet is not enough to blog all the spam

  16. Frank Says:

    I still havent seen an answer to the question raised earlier of it is relevant and is not inflammatory, do you not run the risk of deleting valid commentary? As to human or bot, there should be a new category of hubot which would include those folk of third world status who, while not bots are also not fully participatory in the forums (i.e; not human). As to duplicate checkers, all it takes is rewriting in a small way to beat them.

  17. Big Ben Patton Says:

    It seems that in the fight against spam, short pointless comments for backlinks and now on topic plagiarized comments we will all have to find ways that are unobtrusive to normal real readers, and at the same time devise a way that is better than captcha but still involves the need for a human element.

    If not then I guess I am doomed to answering what 4+4 is or if something is hot or cold, and what was that last one? Am I human? Are you serious?

    I think we need to hold a conference on this and put pens to paper and see what we can come up with. Captcha is busted and we need something better.

  18. Jay Says:

    At first I thought it looked kind of natural to me then the repetitive nature of *RDF* and *Technology* seem to me that they may have been the anchor keywords this bot was trying to gain popularity in. The scary thought is these bots can be out sourced on places like rent a coder and built for about $100, probably by a person who’s first language is not even English, and here we are debating if its real or automated. Kinda funny.

  19. Brad Blogging.com Says:

    I keep seeing comment spam that looks more and more like real comments. Soon the technology the spammers use will be so good, it will get by the best filters.

  20. diTesco Says:

    Definitely an interesting debate as to human or not, the act of these apparent spammers. I have seen blogs with so many comments that it is hard (except for th obvious – great post – nice post – where did you get this template, etc) to sometimes identify the good from the bad. Hard work, but comments can always be moderated, or trashed even if has been published. Only filter I know that actually works.

  21. Pande Says:

    I think there’s a physical spammer behind that kind of comments. Bots aren’t that sophisticated…
    ;-)

  22. David Says:

    This is an interesting situation. We don’t know if the post was done by a person or a computer.
    Alan Turing said that real artificial intelligence exists only if the output from a computer is indistinguishable from a human response.
    Are we there yet?

  23. Sebastian Says:

    I had the same situation too, Sorry for my bad english but i am german. all i can understand was interesting, nice post cu

  24. Daniel Tetreault Says:

    Hi there. I recently was reading a blog post on spamming. The writer mentioned your blog post and that’s how I came across this post. I too am a webmaster however I must disagree with this comment made by Gravity:

    ” … Gravity Says: August 5th, 2008 at 1:24 pm

    i am using a challenge plugin for my wordpress to avoid spam comments, Askimet is not enough to blog all the spam
    …”

    I have been emailed by Wordpress to moderate just three spam issues. About 98% of all my spam is caught automatically with Askimet. I love the Akimet WP Plug-in!
    Daniel,
    Sidney, BC







UMBC