EBB upgraded to 2.0.4

July 31st, 2006

Thanks to Harry Chen and especially to Filip Perich, the ebiquity blog has been upgraded to WordPress 2.0.4. You won’t see many changes, but the new release fixes some serious security problems and many minor bugs. Both Harry and Filip completed their Ph.D. degrees at UMBC in 2004 and have continued to be involved, both with UMBC and the ebiquity group.

How the W3C geo vocabulary is used

July 27th, 2006

longitude and latitudewgs84_pos is a very basic RDF vocabulary developed by the W3C to provide the Semantic Web community with a namespace for representing latitude, longitude and other information about spatially-located things. Geo is currently the 10th highest ranked vocabulary according to Swoogle’s special ontology ranking algorithm. Last month the W3C created a Geospatial Incubator Group, to begin addressing issues of location and geographical properties of resources, starting with a study of geo. Dan Brickley asked if we could use Swoogle to collect some data on how the geo vocabulary is being used. See this note for our attempt to answer his questions.

Revealed: how Google manages click fraud

July 25th, 2006

Automated click fraud deviceIn February 2005, Google, Yahoo, and Time Warner. were sued by Lane’s Gifts & Collectibles in a class-action lawsuit over click fraud. The company alleged that Google and the other companies had been improperly billing for pay-per-click ads that were not viewed by legitimate potential customers. The case was settled earlier this year and as part of the settlement Google agreed to have an independent expert examine their click fraud detection methods, policies, and procedures and make a determination of whether or not they were reasonable measures to protect advertisers. The expert was Alexander Tuzhilin, a Professor of Information Systems at NYU. A summary of his report has been posted in Google’s official blog:

“The bottom-line conclusion of the report is that Google’s efforts against click fraud are in fact reasonable. At several points in his report, he calls out the quality of our inspection systems and notes their constant improvement. It is an independent report, so not surprisingly there are other aspects of it with which we don’t fully agree. But overall it is a validation of what we have said for some time about our work against invalid clicks.”

The full report contains lots of interesting information on Google’s approach to dealing with click fraud. Here’s the high-level description of the approach

“Google has built the following four ‘lines of defense’ for detecting invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.”

The report also has a good overview of the pay-per-click model and Google’s AdSense program. Professor Tuzhilin concluded that the basic approach is sound and the thresholds set entirely by the engineering team with no input from financial officers. Here’s his bottom line:

“In summary, I have been asked to evaluate Google’s invalid click detection efforts and to conclude whether these efforts are reasonable or not. Based on my evaluation, I conclude that Google’s efforts to combat click fraud are reasonable.”

If you don’t have time to read the full 47 page report, the Search Engine Watch blog has summarized some of the most interesting findings.

While this report is very interesting to the technical community, not all of the plaintiffs are happy with the $90M settlement and some are fighting it. In a related development, CNET reports that Google has announced it will show advertisers the number of invalid clicks on their ads starting Tuesday.

Full disclosure department: the UMBC ebiquity site carries AdSense ads and uses the resulting income to support Mr. Capresso.

Hackers demo cloning implanted RFID chip

July 25th, 2006

RFID tag
Annalee Newitz and Jonathan Westhues demonstrated how easy it is to clone an implanted RFID chip at the HOPE number 6 Conference in New York last week. Newitz had a VeriChip RFID chip implanted in her arm and Westhues read the tag’s unique ID with a portable reader hooked up to his laptop, which recorded the ID. It’s easy to program a new tag with that unique ID, effectively cloning the original tag. See this Reuters article for more information.

(Spotted on engadget)

Tagging surgical equipment with RFID

July 25th, 2006

RFID tag
Several stories appeared this month on a study published in the journal Archives of Surgery on the use of RFID tags to detect surgical items left behind in patients. For example, from the LA Times

No sponge left behind
It’s rare, but it happens: surgical staff lose track of an item in a patient. But a new scan could eliminate such gaffes.
By Susan Brink, Times Staff Writer July 24, 2006

A new study holds the promise that technology will soon help doctors and nurses, with the wave of a wand, make sure they have taken everything out of a patient’s surgical cavity that they brought in — including the gauze pads that may not show up on a post-surgical X-ray.

In the study, published last week in the journal Archives of Surgery, eight patients undergoing abdominal or pelvic surgery at Stanford University School of Medicine agreed to have surgeons use gauze sponges tagged with radio-frequency identification chips during their procedures. After the operation was complete, and before the patient’s wound was closed, one surgeon turned away while another placed a tagged sponge inside the cavity.

When the first surgeon then passed a hand-held, wand-like scanning device over each patient, he or she could correctly pinpoint the location of the tagged sponge left behind. Within three seconds, the sponges were found and removed.

This was one of the use cases we had for a DOD sponsored research project. We encountered two issues: tag size and reading problems. You need tiny passive tags to make this work on the small items that might be left inside a patient. Smaller tags have smaller antennas and are harder to read. A second problem is that RFID signals are attenuated or blocked by metal and liquids. This is a problem since the surgical equipment that might be tagged includes lot of metallic items and, to a first approximation, human bodies are bags of water. I’m not sure if the current generation of RFID is going to provide a general solution for this use case. These kinds of problems is a motivation for developing new tagging technologies like RuBee.

(Spotted on Schneier on Security)

Invisible phone or invisible friend?

July 24th, 2006

Modern life has problems with which our ancestors didn’t have to contend, like how not to look crazy when using your Bluetooth headset in public. Darragh Johnson’s article A Tough Call: Invisible Phone Or Invisible Friend in Sunday’s Washington post discusses it at length.

Google and the Semantic Web

July 23rd, 2006

Tim Berners Lee’s keynote talk at AAAI last Tuesday generated a lot of interest in the Semantic Web. We had many people visit our demonstrations of several Semantic Web related projects, including Spire, Swoogle and Semnews the next day to find out more. Based on my conversations with people at the demonstrations and more generally at the conference, I am surprised at how many of the AAAI attendees knew relatively little about the Semantic Web.

Of course, many articles seized on the questions that Peter Norvig rose at the end of Tim’s talk. Editors write headlines, not reporters, and many tried to frame the stories as Google challenges Web inventor. The funniest post I saw referred to Peter as a “Google suit”. While he is a senior executive at Google, has anyone ever seen him wearing a suit? His normal dress is a Hawaiian shirt, which is what he wore at AAAI.

Peter’s questions to Tim were reasonable ones from the perspective of a company with an established Web business. They are also easily answered. Unfortunately, there wasn’t enough time after the keynote talk to any discussion. My own answers to Peter’s three questions would have been along these lines.

  1. Yes, the technologies needed to support the Semantic Web are complex and new to most of us. Some aspects (e.g., parts of OWL) may be too much for the near term. However, most of the technology is no more complex than that which supports the current web, e.g., relational databases, Web servers with php, servlets, etc., web clients with applets and javascript, etc. As the software matures and people become more familiar with the systems, it will be easily managed.
  2. There is always a struggle to overcome proprietary resistance and get new standards adopted. I rather liked Tim’s answer to this — that this resistance will erode bit by bit as one competitor after another give up pieces of their own proprietary stance. He gave some good examples from the evolution of data sharing on the Web in the 90s.
  3. Google already uses largely automated techniques to identify and deal with Web spam, email spam in gmail, click fraud, etc. We won’t begin by using completely automated techniques to process and make decisions based on data found on the Semantic Web and will be able to develop partly automated systems to decide what data can and should be trusted and by how much.

Each of these areas requires research and exploration and it is going on in the Semantic Web community and, I suspect, within Google, in one form or another.

Semantic Web terms: defined and used

July 17th, 2006

The 1.6M Semantic Web documents that Swoogle has discovered on the Web include about 10,000 ‘ontologies’ that define one or more terms. These Semantic Web ontologies define a total of 1,576,927 named terms — RDF classes or properties. Most of these have never been directly used to encode data. We consider a class to be directly used if it has at least one immediate instance and a property to have been directly used (or populated) if it has been used in a triple to assert a value for an RDF instance. We consider a term to be directly defined if it is the subject of a triple that asserts definitional properties, such a subclass for a class or range for a property.

Analyzing Swoogle’s metadata on terms shows some interesting things. First, there are more than a few terms that appear on the Web as both a class and a property. Second, we can look at the distribution of terms across the four categories based on whether or not they’ve been defined and used. We can summarize the results in the following table.

Based on data from
Swoogle on 7/16/06



Here are some categories we can identify:

  • The green cells are what we might consider ideal, including classes and properties that are both defined and have been directly used to encode data.
  • The pink cells are mostly errors: terms that have been defined/used as both a class and a property.
  • The yellow cells are terms that have been defined but not directly used to encode any data. The majority are terms that we believe were intended to be used to describe instances and data, but never were, for one reason or another. Many ontologies have been created but never really used. Some ontologies have been extensively used to describe data, but not all of the terms have turned out to be useful. WordNet, may deserve special mention. Several of the encodings represent each lexical entry as a class and most have not been used to create instances.
  • The blue cells are terms have not been explicitly defined but have been used. Note that this is much more common for properties than for classes. It’s common to attach a class somewhere in some taxonomy, but many people will invent and use new properties without providing any definitional features (e.g., domain or range).
  • The gray cell represents terms that have neither been defined nor used to encode data. While this sounds strange and may reflect problematic terms, the group includes some ordinary terms that represent XML datatypes. These are often used as the value of a domain or range assertion and show up in Swoogle as terms.

Big spike in blog comment spam?

July 15th, 2006

For some reason the number of spam blog comments making it through our Akismet plugin has gone way up in the past two weeks. According to Akismet’s stats page, it’s not due to an overall increase in comment spam and our blog has not suddenly gotten more popular.

For the past year, we’ve been getting only two to four spam comments a week that weren’t immediately identified as spam by Akismet and required moderation. During the last two weeks we’ve gotten more than ten a day! I’ve not been paying attention to any changes to the rate of spam comments trapped by Akismet. Since I’ve never seen a false positive, I’ve gotten into the habit of ignoring them. It looks like we’re getting ~300 a day. I think that’s much higher than we were experiencing, maybe by a factor of four.

I do notice that we’re getting copies of the same comment submitted to a large number posts going deep into our archives. While I’ve not been paying close attention, I think in the past it was typical for spam comments to be submitted to relatively new posts. Maybe there is new comment spam software out there that uses Google blog search to access RSS feeds of the 100 most recent posts for a given blog.

But why isn’t this showing up in Akismet’s statistics?

If you can shed some light on this, please do

W3C Rule Interchange Format WG publishes usecase and requirements study

July 12th, 2006

A natural and intuitive form to encode much knowledge is as a set of rules. Rule based languages, frameworks and systems come in many varieties and differ in many important characteristics, yet they also enjoy many similarities. Every since the development of KIF in the early 1990s, people have worked toward developing a good rule interlingua that could be used as a high-level specification for a set of rules and also to support translation from one rule-based system to another. Other major efforts include Common Logic and RuleML. The newest effort is being undertaken by a World Wide Web Consortium group that is working to define a rule interlingua based on W3C standards.

The W3C’s Rule Interchange Format Working Group has published a second draft og its RIF Use Cases and Requirements document. The RIF group’s charter is to develop a rule language that will allow rules to be translated between rule languages and thus transferred between rule systems. As a first step, this document describes ten use cases representative of the application scenarios that the RIF is intended to support and identifies 17 requirements derived from them.

You can get a detailed view of the workings of the group and the issues it’s been wresting with from the RIF-WG Wiki or the public-rif-wg@w3.org archive. Comments on the use cases document are sought through September 8, 2006.

Myspace.com most visited Internet domain

July 11th, 2006

More evidence that blogs continue to grow in importance. According to a report from Hitwise Myspace has become the most visited Internet domain.

“www.myspace.com has surpassed Yahoo! Mail as the most visited domain on the Internet for US Internet users. To put MySpace’s growth in perspective, if we look back to July 2004 myspace.com represented only .1% of all Internet visits. This time last year myspace.com represented 1.9% of all Internet visits. With the week ending July 8, 2006 market share figure of 4.5% of all the US Internet visits, myspace.com has achieved a 4300% increase in visits over two years and 132% increase in visits since the same time last year.”

The Hitwise post also has some interesting statistics on myspace’s dominance in the most frequent search terms.

What ever happened to KQML?

July 11th, 2006

KQML robot Several times a year I get email messages asking about the status of KQML (Knowledge Query and Manipulation Language) which was an agent communication language developed as part of the DARPA Knowledge Sharing Effort. Below is a slightly revised version of my response to the latest inquiry.

What ever happened to KQML?

There are no organized efforts to further develop KQML or even maintain a list of resources about it. The KQML effort was pretty much subsumed by FIPA’s activities. Many of us who developed KQML worked with FIPA beginning in the late 90s. FIPA continues as an IEEE standards effort. If you want to work with a message-oriented ACL framework, I recommend you go with FIPA.

KQML and FIPA-ACL were the ACLs that most people used. There are some others, of course. SRI’s OAA is one that is used by a community beyond its developers (e.g., in CALO). With support from the DARPA CoABS program, Global Infotek developed infrastructure for a hybrid KQML/FIPA system that is available as the Intelligent Service Layer. DARPA also funded the development of the Cognitive Agent Architecture (Cougaar) framework that has much in common with KQML and FIPA.

For the most part, multiagent systems researchers today are focused on issues other than communication and often use an ad hoc communication language and infrastructure in any implementations. In the 1990s we thought that a standard ACL and associated infrastructure like KQML or FIPA-ACL would enable a computing paradigm based on distributed, autonomous agents. Like most technology visions, it didn’t quite work out the way we imagined it.

Building intelligent distributed systems is very hard and the immediate benefits weren’t dramatic. Industry and many researchers moved on to explore other approaches for the next generation of distributed computing: grid computing, the semantic web, web services, SOA, Web 2.0, etc. The autonomous agent paradigm is still compelling and will be an important part of future systems. We weren’t ready to deploy it widely in the 1990’s and probably still are not. It remains an excellent framework for research and experimentation on some of the fundamental issues that must addressed in order to realize more intelligent distributed systems.

If you want to build a multiagent system and need to communicate at the knowledge level, I recommend using the FIPA standards. They are mature, well documented and supported by open source software packages, such as JADE.