Fighting kleptotorial splogs

August 31st, 2006

Can CC licenses be used to fight splogs? Doc Searls blogs about kleptotorial splogs, a term I rather like, to describe splogs that use feed aggregation systems to appropriate posts to fill their splogs. This is done without permission and typically without attribution. One protection against this is adding a Creative Commons license (e.g., Attribution NonCommercial ShareAlike) to your blog. This gives you better standing to complain to the splogger’s hosting site, ISP and/or ad broker.

UMBC #3 in 2004 CS degrees produced

August 30th, 2006

UMBC logo NSF collects data on Science and Engineering education and research and makes it available through their WebCASPAR system. We’ve used this to generate statistics on Computer Science degree production. In 1998 we generated data that showed that UMBC was producing more CS degrees than other US peer institutions in the mid 1990s. We’ve just updated the figures which show that among US research universities in 2004, UMBC ranks number three for Computer Science degrees produced.

Flickr geotagging is a big success

August 30th, 2006

Flickr’s new geotagging feature is a big success, with more than 1.6 Million photos tagged in the first day. Geotagging a picture is easy — just drag it onto a Yahoo! map. The interface is well done. The search interface is also convenient — check out this search for boat tagged photos in Annapolis MD. Click on one of the numbered spots to see the photos. There is also a map for all of the geotagged photos for each Flicker account, like this one for our ebiquity photos. What’s even better is that all of this is available through Flicker’s API.

The downside is that we’ll have to go back and geotag all of our old photos.

Two Semantic Web visions: meaningful data or meaningful text?

August 29th, 2006

One of the minor frustrations about working on the Semantic Web is that most people imagine the phrase to mean something other than the idea we’re working toward. The W3C vision is a Web of data published in RDF and supported by RDF and OWL ontologies and accessed by programs and agents. Scratch a random Web user, however, and you are likely to find someone who is imagines a Google like system that understands the full meaning of both the queries people type in and the conventional Web pages it indexes. Such a system would be wonderful, but it’s an ambitious project that will take decades or even generations to fully achieve. But maybe we can already build something that understands some of the meaning of natural language text in queries and on Web pages.

Hakia is yet another startup trying to build a better search engine that tries to capture a bit more of the meaning.

“hakia is building the Web’s first “meaning-based” search engine, one that will bring answers and meaningful results to questions on any topic. To achieve this goal, hakia uses a proprietary semantic system instead of a conventional index. Meaning-based search will appeal to all Web searchers – especially those engaged in research on complicated subjects, such as medicine, law, finance, science, and literature. hakia engine is designed to evolve and improve its capabilities with each user interaction and by crawling the Web.”

Is it any good? Well, some of the queries I tried showed some promise but often typing the same query into Google did just as well.

  • Q: is tylenol linked to heart problems? A: hakia, Google
  • Q: how long should shrimp be boiled, A: hakia, Google
  • Q: does ruby have mutable strings? A: hakia, Google
  • Q: is it legal to rip copy protected cds, A: hakia, Google
  • Q: What US zip code has the largest population of canadian citizens, A: hakia,

A system like hakia, if gotten right, would certainly be useful, but is not likely to meet the goals of the W3C’s Semantic Web vision in my lifetime.

Want a bite of Apple?

August 27th, 2006

Apple Stock was trading in the 50s a month ago, now approaching the 70s. In the past couple of months it has gained 30%. Apple reported increased sales of their Macbook series.
Apple has stepped up their ad campaigns, and these Apple videos are all the rage.

Now it is back to school season, and a quick look at the Amazon’s bestsellers list shows more than half the entries are Apple macbooks or iMacs. Macbooks are definitely hot, wanna bite this AAPL ;) ?

Buy low and spam high, it works!!

August 25th, 2006

I was led to this amazing story by a report on NPR today (Penny-Stock Spam Yields Profits — for Some) and then a BBC article (Spammers manipulate stock markets). Researchers Laura Frieder and Jonathan Zittrain have analyzed the effectiveness of email spam touting penny stocks. You know the type:

From: Stock Gurus AA <>Subject: Sleeper Stock Alert (GFPE)*** EARLY BIRD INVESTOR ***GFPE Huge Advertising Campaign That will run allweek.. watch this stock move move move.. get in while youcan at a price that is affordable and double mabye eventriple your investment in a matter of 24/72 hrs. this is aonce and a lifetime oppertunity.  this stock has some awsomenews being released later tonight and is definatly not to bemissed!!…

The authors estimate that about 15% of current email spam messages are stock touts.

Their recent article

Frieder, Laura and Zittrain, Jonathan, “Spam Works: Evidence from Stock Touts and Corresponding Market Activity” (July 25, 2006).

describes their analysis of more than 75,00 messages sent between January 2004 and July 2005. The results show that such spam can effect markets. Spammers who buy low-priced stock, flood the internet with spam touting it, and then immediately sell typically achieve a return of between 4.9% and 6%. The suckers who buy the stock, driving up the price, only to see it drop by up 8% as the spammers cash in.

Their raw data and interactive charts showing price and volume changes for individual touted and control stocks is available online

How to get more to get more comment spam

August 25th, 2006

Mark at Weblogs tools collection makes this interesting observation about how the WordPress default first post is a spam magnet.

Spam Magnet I monitor search engine hits for my various blogs and over the past couple of weeks, the predominant magnet for search hits on new blogs and consequently comment spam (attempts, thanks to Akismet) has been the results from the search linked above. It is strange to watch an IP visit a blog on my server from that search result and instantly access wp-comments-post.php. If you recently started a WordPress blog or are planning to start a new one, I believe that you can reduce some of the deluge by removing the default post that says “This is your first post. Edit or delete it, then start blogging!”.
Share and Enjoy:

It reminds me of our own experiemnts with splog bait.

Finding Feeds That Matter on the Blogogsphere

August 24th, 2006

There are a whole LOT of blogs out there! As the blogosphere gets larger, it is becoming increasingly difficult to find good feeds. “FTM! Feeds That Matter” is a prototype service that was developed out of a need for find interesting feeds on a topic.

The system was built by analyzing the publicly listed Bloglines feed subscriptions. For details, see Feeds That Matter: A Study of Bloglines Subscriptions and some of the related posts here and here. Using the subscription information, we describe techniques to induce an intuitive set of topics for feeds and blogs. These topic categories, and their associated feeds, are key to a number of blog-related applications, including the compilation of a list of feeds that matter for a given topic. The site FTM! site was implemented to help users browse and subscribe to an automatically generated catalog of popular feeds for different topics.

Wikipedia to experiment with trust

August 24th, 2006

WikipediaThe German version of Wikidepia will implement a simple trust based system to improve quality and combat vandalism according to a CNET article based on an interview with Wikipedia founder Jimmy Wales.

Can German engineering fix Wikipedia?, Daniel Terdiman, 8/23/06

“An experimental feature planned for the German version of Wikipedia could eventually improve the quality of editing for the online encyclopedia and open its front page to public edits for the first time in years.

As always, anyone will be able to make article edits. But it would take someone who has been around Wikipedia for some yet-to-be-determined period of time–and who, therefore, has passed a threshold of trustworthiness–to make the edits live on the public site. If someone vandalizes an article, the edits would not be approved.”

While this scheme is overly simplistic, it’s a good start. I imagine that the policy could evolve and improve rapidly once it is put into place and people try to get around it. There is room to game any trust-based approach, of course, but having a well designed one could help a lot. Wikipedia would be a great testbed for trying out the many ideas for computation trust that researchers are generating (e.g., see Investigations into Trust for Collaborative Information Repositories: A Wikipedia Case Study). Perhaps different segments of Wikipedia, either by language or topic, could use different mechanisms, proving a way to find out which ones work best in practice.

Proposed: a consortium to address search query privacy policy

August 23rd, 2006

PrivacyPrivacy advocat Lauren Weinstein has proposed the creation of a high-level working group/consortium to address fundamental aspects of the technology and policy issues surrounding search query privacy and related topics.

“Participation by all stakeholders would be invited. Representatives of the major search engine firms and concerned government agencies, outside technologists and other persons involved in privacy and search issues, and other entities as appropriate would all play important roles.”

This sounds like a good idea, althought I suspect that it may be hard to delineate search query privacy from related issues.

100 most common RDF namespaces

August 23rd, 2006

Swoogle logoHere is some data we collected from Swoogle on 22 August 2006 for Frederick Giasson’s Ping the Semantic Web project. This table shows the 100 most common RDF namespaces measured by the number of Semantic Web Documents (SWDs) that use them. For each one, we give the most common abbreviation, the percent of SWDs using the name space that use the most common abbreviation, and the number of SWDs using the namespace as an absolute number and as a percent of all SWDs. You can download a larger table as a spreadsheet here.

VCs see opportunities in Blogosphere

August 22nd, 2006 has an article, VCs see opportunity in blogosphere (also carried by CNET) on VCs investing in blog sites like Note that we are not talking about investments in Blogosphere infrastructure, but in content producing blogging sites.

I liked this quote from the article:

“We think the news of the future will look like The Huffington Post. It includes breaking news, instant commentary, blogs and community, with a comments section that can be almost like a miniblog.”

from Eric Hippeau, a managing partner of SoftBank Capital. The article points out that HuffingtonPost is already quite popular

“Statistics from’s Alexa Internet show that HuffingtonPost is nearing Salon and Slate in Web traffic.”

Of course, what’s new and exciting about the Blogosphere is that it gives us all a voice, or at least a chance for a voice. If it becomes dominated and controlled by mega, moneyed blogs like Huffington or Gawker’s blogs, then we’ve lost a good thing.