 | 2006 August 
Archive for August, 2006
August 10th, 2006, by Tim Finin, posted in GENERAL
An easy way to boost a Web page’s rank is to leave comments on blog posts with a link back to your page. Popular blogging systems let the commenter associate a link back to her Web page even if there are no links in the comment body, which are often disallowed. So a typical spam comment has, in addition to the commenter’s name and link, some text. This text comes in many varieties, including:
- Obvious come ons: Free ringtones!
- Generic statements: Looking for information and found it at this great site.
- Markovian delights: Uruguay’a sunlit straightened avers compellingly levelly
- Random letters: sherthul axrteg hioqurtch
Today I noticed that someone tried to post a comment on an ebiquity post on large RDF documents:
Name: Thommes | E-mail: info@linkfeed.de | URI: http://www.linkfeed.de | IP: 145.254.226.204 | Date: 8/9/2006
The future for RDF for learning technology specifications is bright, and the possibilities opened up by RDF and Semantic Web technologies promise to take learning technology project to a new level of applications.
It’s quite relevant to our post, but not to the commenter’s Web site. Google reveals that the comment text was plagiarized from the Web. I have not been able to reconstruct the strategy that the spammer or his bot must have used. Does it start with a random sentence selected from the web and then use a feed or blog search engine to find posts to comment on? Or does it find a post to comment on and then search on the post’s title to find a Web page from which to plagiarize a sentence?
In any case, this kind of comment spam is going to be hard to recognize automatically. Catching this requires correctly identifying the commenter’s link as pointing to a Web spam page. And for me, verifying that the comment text was plagiarized convinced me that the target was spam.
Edit | Bookmark@del.icio.us | Trackback | 24 Comments »
August 10th, 2006, by Tim Finin, posted in Uncategorized
Harry Chen writes about the 2006 Gartner’s Hype Cycle For Emerging Technologies.
In July 2006, Gartner published a new report on the hype cycle of emerging technologies. Last year, Gartner published a similar report. Among those technologies mentioned in 2005’s report, Corporate Semantic Web, mesh sensor networks, and location-aware applications are the few that also appeared in this year’s report.
Harry notes that technologies newly mentioned in 2006 include Ajax, Web 2.0, folksonomies, social network analysis, offline Ajax, and Wiki.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 9th, 2006, by Tim Finin, posted in Uncategorized
The IEEE FIPA organization will hold a regional meeting in conjunction with the Tenth International Workshop on Cooperative Information Agents (CIA 2006) at the University of Endiburgh, UK. The meeting will be held on 13 September 2006. You can register for CIA 2006 and/or the FIPA meeting . For more information, contact Stefan Poslad, the EU Chair of IEEE FIPA.
FIPA, the Foundation for Intelligent Physical Agents, is an IEEE Computer Society standards organization that promotes agent-based technology and the interoperability of its standards with other technologies. FIPA maintains a mature and implemented set of standards for communication languages, protocols and infrastructure for multiagent systems.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
August 8th, 2006, by Tim Finin, posted in Uncategorized
AAAI-06 featured a successful special technical track on AI and the Web. The track committee selected 30 papers from the 106 submissions, 20 for oral presentations and 11 as posters. AAAI-07 will be held 22-26 July 2007 in Vancouver and will again include a track on AI and the Web. Deadlines for submitting abstracts and papers are February 1 and 6 2007, respectively.
The track invites research papers on AI techniques, systems and concepts involving or applied to the Web. Papers should either describe Web related research or clearly explain how the work addresses problems, opportunities or issues underlying the Web or Web-based systems. For details, see the AAAI-07 AI and the Web track Web site.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 8th, 2006, by Tim Finin, posted in Uncategorized
Jonathan Shewchuk has a great page on giving an academic talk that is full of sound advice. He says
“This is a sample of my opinions on how to give a talk (using slides or transparencies) in computer science, concisely distilled for my students and students attending Graphics Lunch. Most of these thoughts are based on my going to conferences and seeing the same mistakes repeated by a plurality of speakers. You are welcome to disagree with my opinions, as long as you think each issue through for yourself. The only sin to make a choice without knowing you are making one.”
His page covers both preparation and delivery. I’d add to that a section on what you should to after you give your talk, including putting your slides online and making it easy for people to find them. I put together some thoughts on this and related topics for increasing research visibility on the Web for our graduate course on Basic Research Skills.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 7th, 2006, by Tim Finin, posted in Uncategorized
Wikimapia has made it easy to add a Google map with Wikimapia data to any Web page. Their blog describes the process:
- Go to WikiMapia.org site and find a part of the map that you need.
- Click WikiMapia at the top right corner and choose map on your page link.
- Move and resize frame you see to desired view, adjust view setting if needed and copy given html code to your page or blog.
Here’s the area around UMBC.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
August 7th, 2006, by Tim Finin, posted in Uncategorized
David Sifry’s latest quarterly State of the Blogosphere report has some comments on splogs:
“What we have found, after lots of analysis and spam elimination, is that we see about 8% of new blogs that get past our filters and make it into the index, even if it is only for a few hours or days.” … “About 70% of the pings Technorati receives are from known spam sources, for example, but we’re able to drop them before we even send out a spider to go and index the splog.”
This matches our own figures, although the splogs that go unrecognized contribute more than their share of posts. One reason why there are so many splog pings (spings) is that many splogs post every hour or even every five minutes. Apparently, splogger greed knows no bound.
Other highpoints from his excellant and useful report are:
- The Blogosphere is over 100 times bigger than it was just 3 years ago.
- The blogosphere is currently doubling in size every 200 days. This is somewhat slower than past growth rates.
- As of July 2006, about 175,000 new weblogs were created each day.
- The total posting volume is up to ~1.6M postings per day, about double that of a year ago.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 6th, 2006, by Tim Finin, posted in Uncategorized
Wikipedia it hot. Hot, hot hot!
The New Yorker published a good article on Wikipedia, Know it all, Can Wikipedia conquer expertise? by Stacy Schiff two weeks ago. It contains a lot of good information, much of it new to me, and puts Wikipedia in a larger context.
I heard two interesting papers at AAAI 2006 on using Wikipedia as a knowledge source. It’s been very popular recently to use search engine access to the Web as a giant brain. The two Wikipedia papers showed that using the Wikipedia subset of the Web showed better results.
Denny Vrandecic called Wikimania 2006 “Maybe the hottest conference ever” and really made me wish I had attended.
Last week I happened across Wikimapia, a “resource that combines Google Maps with a wiki system, allowing users to add information (in the form of a note) to any location on the globe”. It looks like the map is also populated with geo tagged objects extracted from Wikipedia. See the area around UMBC as an example.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 6th, 2006, by Tim Finin, posted in Uncategorized
At the end of last week, AOL Research announced that it was releasing for research purposes several datasets from its search engine, including query streams for 500K users over three months. Adam D’Angelo points out that this could compromise the privacy of AOL users. The data has been anonymized, of course, by replacing user ids within a query session with a unique number. But some query streams might contain enough information to allow someone to make a good guess at the user’s identity. This sort of query data is one of the things that Google refused to provide the Department of Justice last Spring. On the other hand, Microsoft was offering researchers similar query data earlier this year. I think it’s a close call.
Update: As of 10:00pm Sunday night, the query stream data link is no longer there.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 5th, 2006, by Tim Finin, posted in Uncategorized
AOL Research has released some interesting data collections, including:
- 20K hand labeled, classified queries
- 3.5M web Q/A queries (who, what, where, when …)
- Query streams for 500K users over three months (20M queries)
- Query arrival rates for queuing analysis
- 2M queries against US Government domains
Additional datasets are promised in the future.
A paper describing some measurements over this (or related?) data is available: A Picture of Search by G. Pass, A. Chawdry and C. Torgeson.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
August 5th, 2006, by Tim Finin, posted in Uncategorized
Yesterday we posted about a message from Google about our ads. We had guessed that our page on click fraud had triggered an automated bot looking for AdSense policy violations because it contained too many instances of the word click. We heard back from the Google AdSense team later that same day and our interpretation of the problem was wrong.
We found the following language above the Google ads on the page, which we feel brings unnatural attention to the ads: “These ads provide our students with coffee, an essential component of a healthy research laboratory”
We had added that comment because we felt vaguely guilty about putting ads on our academic site. We didn’t want people to think we were raking in large sums and spending it on luxuries, like donuts tee-shirts. This modest income stream goes entirely toward a single, absolute essential: coffee.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
August 5th, 2006, by Tim Finin, posted in Uncategorized
Google announced that it will share an enormous word n-gram dataset culled from a training corpus of one trillion words from public Web pages. In the 1T words of text, Google found 1.1B five-word sequences that appear at least 40 times and 13.8M words that appear at least 200 times. The dataset will be distributed by the Linguistic Data Consortium.
Google’s describes its motivation as follows
“We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.”
They are right — this will be a great resource that will help advance our capabilities in many language-oriented tasks and some related knowledge management problems. I guess it’s time for us to invest in more disk space…
Edit | Bookmark@del.icio.us | Trackback | Comments Off
|  |
|  |