A new measure of a researcher’s impact

August 29th, 2005

UCSD Physicist Jorge Hirsch has proposed the h-index as a new bibliometric measure of a scholar’s impact based on the number of publications and how often each is cited. See this story in Physics World for an overview. H-index can be defined as follows:

A person who has published N papers has h-index H iff they have H papers each of which has at least H citations and N-H papers with fewer than H citations.

You can easily estimate an author’s h-index using Google Scholar since the results are ranked (more or less) by the number of citations which are shown in the summaries. Try looking for papers authored by Turing. His 15 most cited papers all had at least 17 citations. His 16th most cited paper had only 13 citations. So Alan Turing’s h-index is 15.

This example, of course, shows one problem with basing this on Google Scholar — it only takes into account papers it finds on the Web, a disadvantage for Turing. Another is that Google doesn’t eliminate “self citations” — citations where there is an author common to both the cited and citing papers. Accepting self citations invites gaming the system by always citing all of your earlier publications. Citeseer is a web based system that does eliminate self citations as does ISI‘s the venerable citation database. But CiteSeer doesn’t rank author queries by citation number and also weights them by year. ISI’s coverage for Computer Science is not comprehensive and access costs money. So Google Scholar seems to be the easiest way to play with the h-index idea for CS at present.

Google Scholar and Citeseer automatically discover and index papers of all types — journal, conference, book chapter and even technical reports — unlike traditional citation databases like ISI’s. Should all of these be contribute to a scholarly output metric? I think it’s not unreasonable. A technical report cited by 50 other papers has obviously had impact. Moreover, a paper’s visibility on the Web may become the dominant factor in its significance.

Hirsch argues that h is better than other commonly used single number criteria to measure a scholar’s output. He’s even suggested it could be used for tenure and promotion

Moreover, he goes on to propose that a researcher should be promoted to associate professor when they achieve a h-index of around 12, and to full professor when they reach a h about of 18. (Link)

What counts as a high number will vary across disciplines and even sub-fields within disciplines. Moshe Vardi tells me that Computer Scientists with h>50 are rare and Jeff Ullman’s number in the mid-60s is the highest he’s seen.

Finally, single number measures like this are always just shadows cast on the wall of a cave.

Popular Terms for the Semantic Web

August 26th, 2005

A recent text mining based approach builds the “Semantic Web Encyclopedia of Terms” listing interesting terms about the Semantic Web.
Terms are ranked to show their relevance to the Semantic Web and categorized by a hierarchical taxonomy. Each term comes with “popularity” and “density” rating and a list of relevant terms. The top five are the following.

  1. Semantic Web – Popularity: 89.62% , Density: 4.69%
  2. Web Services -Popularity: 32.28% , Density: < 1%
  3. Tim Berners-Lee – Popularity: 25.28% , Density: < 1%
  4. World Wide Web -Popularity: 20.32% , Density: < 1%
  5. Resource Description Popularity: 16.93% , Density: < 1%

The Semantic web’s place on the Hype Cycle

August 25th, 2005

I saw a link to Gartner’s Hype Cycle pages and thought I’d see what they said about the Semantic Web technologies. You’ve seen these graphs before — they chart the ups and downs of the ‘visibility’ for an idea or technology over time.

Gartner’s roller coaster ride works like this. An idea first appears after a technological trigger and begins a steep rise to the top — the “peak of inflated expectations”. Not being a hill climber, or maybe just having no brakes, it just as quickly descends into the “trough of disillusionment”. Screaming, I suppose. Sadder, but wiser, it makes a slow gentle climb up the “slope of enlightenment” to reach it’s final place in life, on the “plateau of productivity”. This plateau seems to be only about half as high as the initial peak. Eventually, the idea must disappear of the chart entirely, just as the shepherd’s sling vanished from the WMD hype chart.

The Semantic Web is mentioned on their “Hype Cycle for XML Technologies, 2005” dated 8 July 2005. One thing you should know is that it will cost you $495.00 to see the document, although you can be teased by viewing the table of contents for free. Another thing to know is that this hype cycle is just one of 15 in the Emerging Technology category which itself is just one of 26 categories! At $495 each, the full set would cost over $100K! What a business model!! (Btw, annoyingly most of their links links are calls to javascript, making it hard to reference their content.)

The table of content appears to mention the key items on the chart. For the semantic web these are:

On the rise XML Topic Map
At the peak Public Semantic Web
Sliding into the Trough OWL, RSS
Climbing the slope RDF
Off the Hype Cycle Semantic Web

Of course, you get what you pay for and this much is free. Who knows what these terms mean and on what basis these predictions are made. I was surprised to see topic maps just starting out though. Maybe it bought another ticket to ride.

UK tests active RFID license plates

August 23rd, 2005

The prospect of every licensed vehicle being required to have an active RFID tag raises lots of privacy issues, although in many ways ways we have them already with visual tags and modern image processing. It also opens the door to many new opportunities.

Brit License Plates Get Chipped, Mark Beard, Wired News, 9 august 2005

The British government is preparing to test new high-tech license plates containing microchips capable of transmitting unique vehicle identification numbers and other data to readers more than 300 feet away.

Proponents argue that making such RFID tags mandatory and ubiquitous is a logical move to counter the threat of terrorists using the roadways, and that it will scoop up insurance and registration scofflaws in the process.

The U.K. Department for Transport gave the official go-ahead for the microchipped number plates (as they are called in the United Kingdom) last week, and the trial is expected to begin later this year. The government has been tight-lipped about the details. One of the vendors bidding to participate in the trial said it would start with smartplates added to some police cars.

The point of the test is to see whether microchips will make number plates harder to tamper with and clone, said U.K. Department for Transport spokesman Ian Weller-Skitt. Many commuters use counterfeit plates to avoid the London congestion charge, a fee imposed on passenger vehicles entering central London during busy hours.

MORE (via Bruce Schneier)

Thieves use Bluetooth to find laptops to steal

August 22nd, 2005

UK Thieves are using Bluetooth phones to scan for and detect Bluetooth enabled laptops left in the trunks of cars. Detective Sargent Al Funge, from Cambridge’s crime investigation unit, said:

“There have been a number of instances of this new technology being used to identify cars which have valuable electronics, including laptops, inside. The thieves are taking advantage of a relatively new technology, and people need to be aware that this is going on. ”

MORE (via Schneier on Security).

Blog spam considered dangerous

August 21st, 2005

Spam blogs (splogs) and spam comment on blogs are a growing problem. Most splogs seem to be hosted by blogger, which made it easy to automatically generate and populate them. Comment spam is a bane to all. Now Google is introducing some features to fight this, including a too to require word verification for comments and a flag as objectionable feature on the blogger Navbar that could be used to slam splogs. However, including the Navbar is optional on blogger blogs and it remains to be seen if such a reputation based scheme will work in this environment outside the lab. Spammers might try to defeat it with false accusations against good blogs if they can manage to have them come from many IP addresses. (via Slashdot)

We think there will be lots of research opportunities here as spammers continue to adapt and evolve their techniques to counter each new anti-spam measure. We’re developing a new project to study and model the structure and content of blogs with just one application being to recognize spogs and comment spam.

Tracking web site visitor locations with gvisit

August 20th, 2005

We’ve added a service (link) that shows locations of recent visitors to our web site. You can get to by clicking on the ABOUT US link in the header of any page and then click on the link Recent web visitors on the navigation menu on the left

It’s fascinating to see the distribution and to zoom in and try to guess where each visitor is really from. Can you find your own tracks?

Here’s how I think it works, more or less. We put the required javascript code in the template that’s used for each EBWEB page that will ping the gvisit service for us. They show locations of the recent “unique” visitors. The javascript must get the visit’s IP address and send it to gvisit with our registered ID. Gvisit uses an IP geolocation database or service (e.g., IP2location) to get the visitor’s long/lat and string describing the location. For each registered site, once an hour, gvisit gets recent visitors with unique long/lats from it’s database (and maybe deleting them) and builds the XML to add to the google map using Google’s API.

How well does it work at localizing your visitors? If 50 visitors from different IP address all from UMBC’s campus access our web site, only one shows up, since all of us will be reduced to a single long/lat, which is in downtown Baltimore. I guess this is where we connect to the backbone. My home machine gets mapped to Arlington Virginia, presumably along with all 10,000 (est) Comcast broadband custmers in the greater Baltimore-DC area. A simple improvement to the gvisit service would be to keep a counter of the number of hits from a given long/lat, so I could see that, say 5 hits came from MIT, 43 from UMBC and 54 from Comcast.

I’ve noticed some mismatches getween the long/lat values and string names — e.g., a location in Maryland that’s said to be in Florida. I’ve also noted some locations that surely must be off the grid — like one at the northnmost bit of Norway. More noise, I’m guessing, but please, correct me if you are reading this and hail from there.

Google changes the presentation of its web search result

August 18th, 2005

When I typed ‘rdf’ in google search, I got something different! A box has been inserted in the middle of the result page (see here).

Web Results 1 – 10 of about 22,900,000 for rdf [definition]. (0.57 seconds)

Resource Description Framework (RDF) / W3C Semantic Web Activity

Resource Description Framework (RDF) / W3C Semantic Web ActivityOfficial pages from the World Wide Web Consortium, includes the specification, resources and news, and a links collection.
www.w3.org/RDF/ – 51k – Cached – Similar pages

See results for: rdf media

RDF Media | Home
RDF Media is one of Britain’s leading independent television production companies,
responsible for hit shows such as Wife Swap, Faking It, …

RDF Media Ltd (PACT) – London, UK
Send an e-mail to RDF Media Ltd (PACT) Email this company. My email address:.
My company:. I’d like to: … RDF Media Ltd (PACT). Telephone and fax …

www.rdf.co.uk. … CO.UK . . . []. Showing 0 – 0 of 0 Items for. terms &
conditions :: © 2004 www.webpark.co.uk :: www.rdf.co.uk :: contact us.

The box demostrates query expansion — google send you back the first three results of an highly relevant alternative query. This change, I believe, should be based on a long time user behavior tracking.
You can try more queries categorized as error correction, e.g., flicker , alo. A Similar feature is adding image search in web search result, e.g. person.

Meet Trump University’s CLO: Roger Schank

August 17th, 2005

Back in May, Donald Trump announced the establishment of Trump University as “a new business education company focused on providing lifelong learning programs for business professionals.”

Trump University will offer a rich mix of products and services, including online e-learning courses, multimedia home study programs, and a series of publications.

Most importantly, Trump University will deliver the experience, knowledge, and wisdom of Donald Trump himself. The real-estate mogul’s personal teachings, experiences and philosophies will be fully integrated into the curriculum. He will be featured in online video, the Trump University website, multimedia learning programs, and much more.

If that were not amazing enough, TU’s Chief Learning Officer (Provost?) is Roger Shank, the famous and sometimes controversial AI scientist.

Roger Schank, professor emeritus and founder of the Institute for Learning Sciences at Northwestern University and one of the world’s top researchers of artificial intelligence, learning theory and cognitive science, has been appointed Chief Learning Officer.

As CLO, Schank will oversee the design and implementation of the e-learning curriculum. Schank, who has also taught at Yale University among other top institutions and is author of some 25 books, is a pioneer and innovator in applying the Learning by Doing philosophy to online education. “People know that they learn by doing precisely because they know that they can learn nothing of value without constant practice” said Schank.

I wonder how one would represent “You’re fired!” using Shank’s conceptual dependency formalism.? I think it would involve an MTRANS of an EXPEL of an ATRANS of … The rest is left an an exercise for the reader.

Where are your readers from?

August 16th, 2005

Gvisit.com offers a clever service that shows the locations of your web site’s recent visitors using Google Maps. After registering your site, you put a bit of javascript on each page you want to track. Your map shows the last 20 or 100 unique visitors, depending on whether you are using the free or paid version. Here’s the Ebiquity visitors map we did for EBWEB, our ebiquity site.

Yea, but was it a fair fight?

August 16th, 2005

Several people have pointed out some interesting issues in the methodology used to estimate whether Google or Yahoo has indexed more documents.

Seth Finkelstein points out that using a random two word test search like “alkaloid’s observance” results in 15 hits on Google and none on Yahoo. But not one of the 15 pages Google found are really of interest — they are copies of word lists or spam blogs. It hardly seem fair to call a foul on Yahoo for not indexing useless documents. I’d pay extra for that service!

Eric Glover observes on Dave Farber’s IP list two implicit assumptions in the experiment:

#1: That both Google and Yahoo use the same relevance function to decide which results to include – or there is some way to post process this to compare equally.

#2: That the Yahoo crawler has is biased in a way that is equally probable for results which are returned for the keywords in the study – or at least close.

I wonder if we could ever develop a consensus technique for such an experiment.

WSDL-S: lighter weight approach to semantic web services

August 15th, 2005

Researchers at the University of Georgia’s LSDIS Lab have developed WSDL-S as a lightweight approach for adding semantics to Web services. This is an alternative to other schemes, including OWL-S and WSMO.

They have also released Radient, an Eclipse plugin, that provides a UI for annotating existing WSDL documents into WSDL-S via an OWL Ontology. Radient uses Harry Chen’s Cobra Ontology Viewer plugin.