Google, HMOs and Sicko

June 30th, 2007

sickoI don’t know about you, but I found this Boing Boing post (Google to HMOs: pay us and we’ll defuse “Sicko”) disturbing.

“Google’s “Health Advertising Team” is trying to sell the health industry on buying ads to be shown opposite searches for “Sicko.” The idea is to counter Michael Moore’s amazing, enraging, must-see indictment of the health industry’s grip on American society by running ads over search results for Sicko.”

Google’s health marketeers wonder in their blog “Does negative press make you Sicko?” and are ready to help, just in case you get a little queasy.

“We can place text ads, video ads, and rich media ads in paid search results or in relevant websites within our ever-expanding content network. Whatever the problem, Google can act as a platform for educating the public and promoting your message. We help you connect your company’s assets while helping users find the information they seek.”

It’s good to have an open market, and companies as well as individuals should enjoy the right of free speech. Micheal Moore certainly does, and some think he abuses it. But excessive corporate advertising and propaganda, especially if it misleads, dissembles or lies, can do a lot of harm. It will be a bad thing if Google enables it on the Web. I don’t think we are there yet, but I’m not sure about the direction we are headed in or why we are in this hand basket.

Powerset all set to power up NLP based Search

June 30th, 2007

As a member of the Powerlabs community (sign up here if you haven’t already), last night I had the privilege of being part of an selected group of audience who were shown an exclusive preview of Powerset’s natural language search engine.powerset logo

What they have achieved is truly amazing! With some of the parsing and Natural Language Processing (NLP) technology licensed from PARC, Powerset has the ability to semantically process not only the queries but also entire documents (and in fact the whole Web). What this means is: unlike statistical approaches to building search indices, Powerset believes that computational linguistics and NLP can add richer semantics to the text. By treating words purely as literals most search engines cannot “understand” it’s “meaning”. In contrast Powerset’s approach is to figure out the Part of speech, Named Entities, Relations and add “facts” to a repository of knowledge by mapping words to their meaning in an ontology (using Freebase, wordnet and other knowledge resources). Consider the following queries that were demonstrated:

– which companies were acquired by Peoplesoft

– Which company acquired Peoplesoft?

– Acquisitions in 2001

Note that queries don’t have to be purely question based: this is a confusion that most people seem to have about NLP-based search vs. Question Answering. Traditional search engines would really have a hard time disambiguating the semantics of such queries, since they ignore terms like “in”, “by” etc. Search engines would also cannot map words like “bought over”, “taken over”, etc to its conceptual meaning of “being acquired”. Not to mention the difference between “being acquired” vs. “acquiring”. Based on our group’s experience with language technology related research and SemNews, we can really appreciate the complexity and significance of these tasks – it’s a really hard problem and Powerset is really trying to ride the Moore’s law to make it feasible to semantically annotate and index large collections.

Powerset is also introducing Powerlabs where its community of users can provide feedback, ideas and actually participate (Digg style) in the product’s development (and earn points to become experts on a topic). This would be a fresh change from the typical launch strategies of most startups these days. Surely will also be APIs, widgets and new mashups that are going to come out of this.
Overall, I was pretty impressed and think that if Powerset gets it right this would be a huge leap in terms of search. However, one challenge would be to retrain users on how they think about queries. Powerset is trying to overcome this with new interfaces for search engines and interestingly will let the user community on Powerlabs decide what they like and dislike – a Social Media approach to product launch!

Kevin Burton, Dan Farber and others have also chimed their thoughts from the demo.

Humans vs the hairless bipeds in OWL

June 29th, 2007

On the W3C Owl Development mailing list there has been an interesting discussion raised by a query from Denny Vrandecic about annotation properties. He asked if ex:A rdfs:label “X” and ex:A owl:sameAs ex:B entail ex:B rdfs:label “X”. The issue is that rdfs:label is defined in OWL as an annotation property. These are special properties that can be used to assert values for classes without requiring the result to be in OWL Full.

The annotation property concept dates back to KL-ONE, I think, and comes with the notion that such properties don’t normally participate in entailments — they’re just extra annotations, useful for properties like lastEditedBy and such.

Denny framed his question as “are annotation property instances connected to the URI or the underlying individual?”. The upshot seems to be that the entailment holds for owl:sameAs but not owl:equivalentClass, which Denny found unintuitive. I thought that an explanation from Pat Hayes was marvelously clear and sheds light on the more philosophical undepinnings of OWL.

“Allow me to suggest the appropriate intuition. In RDFS and OWL-Full there is a distinction between a class and the (set which is the) extension of the class. So two different classes might have the same sets of instances and yet still be distinct classes. (This is often described by saying that RDFS and OWLFull classes are ‘intensional’ as opposed to ‘extensional’. For example, the classes of Human and HairlessBiped have the same members, but one might want to distinguish them since they have different defining conditions on membership. OWL-DL refuses to countenance such a possibility, although this may be rectified in OWL 1.1.) Thus there is an intuitive distinction in meaning between equivalentClass (having the same instances) and sameAs (being exactly the same thing). When, as in RDFS and OWL-Full, classes can have properties, one wants to preserve this distinction by saying that if A sameAs B then all the properties of A are also properties of B (since A and B are the very same thing); but this does not follow for A equivalentClass B, since A and B might still be distinct even if they do have the same members.”

I’m just glad no own brought up birds.

Faceted search for DBLP bibliographic data

June 29th, 2007

DBLPMichael Ley started DBLP in 1993 as an experiment in providing Web access to bibliographic information on database systems and logic programming using the then new Web infrastructure. Over the past 14 years it has grown to be an important resource for Computer Science with high quality information on more than 900K articles from selected journals and conferences across the discipline. To reflect the broader scope, its acronym is now taken to stand for Digital Bibliography and Library Project. (Btw, it has always seemed ironic to me that the DBLP data is not stored in a database system, but rather in a large collection of files glued together by a set of scripts — sort of like the Universe.).

DBLP is also been a great dataset for many research projects since its information can be freely downloaded as an XML document (a big one!). For example, our research group has used it for data mining, visualization, social networking, and some semantic web work.

The L3S Research Center at University of Hannover has a new DBLP faceted search service that lets users do keyword searches over all of the metadata and also supports more elaborate navigational access to the collection. Searchers can restrict queries by topic, where paper topic keywords are automatically generated using “higher-order co-occurrences of author keywords”. The system uses a novel Semantic GrowBag algorithm that uses the “structure of the semantic network induced by the usage of keywords over the document corpus”.

This looks quite useful.

Doctors cautious about RFID implants

June 28th, 2007

Here’s an item from the New York Times’s new Bits (for Business, Innovation, Technology, Society) blog.

The A.M.A. Gets Under Your Skin, Barnaby Feder
“On Monday, the delegates to the annual meeting in Chicago of the American Medical Association endorsed what could only be described as a cautious committee report on the role of implanted radio identification tags in patient care. The three recommendations were that doctors make sure that they get informed consent from patients before using such tags, they they not program any medical information into them without making sure it would be as secure as normal medical records, and that they support research into uses of the devices. …more…

A coming World Wide Web of virtual worlds

June 18th, 2007

A we wrote in the Online gaming’s Netscape moment? Video games: Existing virtual worlds are built on closed, proprietary platforms, like early online services. Might they now open up, like the web?

The premise of the article is that virtual world game engines like Multiverse are not only making it easier to create new worlds but will also allow characters and game entities to move from world to world. They liken his to the change that happened as the Internet moved from being a series of “walled gardens” based on proprietary online services like the well, AOL and Prodigy to the Web and its standard languages, protocols and middleware.

Read the rest of this entry »

Wikipedia banned from Universities

June 17th, 2007

An article in the Vallejo Times Hearald, Wikipedia banned from UCSC class sounds very alarming, but it turns out not to be so bad. Maybe even reasonable.

“SANTA CRUZ – UC Santa Cruz professor Dan Wirls adopted a policy banning students in his American government class from citing Wikipedia in research papers. It’s not that the collaborative online encyclopedia is bad or wrong – though inaccurate information is always a risk, says Wirls and other UCSC faculty who are noticing a growing number of students using Wikipedia. The main gripe from Wirls, chairman of the politics department, is that students “are entering college with almost no research skills beyond their rudimentary use of the Internet. “They do not know how to use a library,” he said.

The article notes that Middlebury College’s history department was the first known department to ban Wikipedia citations when it did so this past February. Wikipedia’s position on this is supportive:

“Wikipedia is the ideal place to start your research and get a global picture of a topic. However, it is not an authoritative source,” said Sandra Ordonez, a Wikipedia spokeswoman. “We recommend that students check the facts they find in Wikipedia against other sources. It’s usually not advisable, particularly at the university level, to cite an encyclopedia.”

The larger problem here is the Web is the first place that most of us turn to when looking for information. For an increasingly large set of topics, a Wikipedia article is not only one of the first few results returned by a search engine, it’s also one of the best, at least for a reasonably objective, concise and trustworthy overview. Tracking down information to a primary source is difficult, requiring skill, experience and time. More often than not, the primary source may require considerable background knowledge to be able to read and understand. It’s hard work!

How owl:import is used

June 15th, 2007

On the W3C’s public-owl-dev mailing list, Jeremy Carroll mentioned an example that involved the set of documents that import at least four ontologies. I’ve been wanting to look at some data on hoe people are using owl:imports for a while and this prompted me to do it. Swoogle has data on just over 2.3M RDF documents it’s found on the web. These include about 11K imports relations. If we ignore a handful of documents that import RDF, RDFS or OWL, there are 6661 unique documents that import at least one documents and 1613 documents that are imported by at least one document.

How common is it for a document to import many ontologies? Not very, it turns out. Of documents that import any at all, 71% import just one, 89% import one or two and 97% import five or fewer. The five documents that import the most ontologies are these:

Read the rest of this entry »

Visualizing the growth of neighborhoods

June 13th, 2007

Trulia ia a real estate search engine with some interesting features. Their mission is to “helps you find homes for sale and provides real estate information at the local level” and their business model is based on advertising. It has a good map based interface.

Matt Hurst blogged about their Hindsight tool, which is a fascinating visualization of historical data.

“Trulia Hindsight is an animated map of homes in the United States from Trulia. The animations use the year the properties were built to show the growth of streets, neighborhoods and cities over time.”

The visualization is based on public property assessor records for properties in Trulia’s database, presumable because they have been offered for sale in the bast five years. These records typically include the date that a house was built. As an example, look at this animation for the Baltimore area.

trulia hindsight

CS PhD production up 26% in North America

June 12th, 2007

The Computing Research Association reports that CS PhD production in the US and Canada jumped 26% last year.

According to CRA’s Taulbee Survey, the number of doctorates granted by CS and CE departments in the US and Canada increased 26% between academic year 2005 and 2006, to 1,499. This is the third year of strong growth in degree production. Even more doctorates are likely to be granted in 2007: the number of students who passed their thesis candidacy exams increased 19% from the previous year. Further along, there are some signs that the growth in production will ease.

CRA’s Taulbee Survey is *the* source for data on the enrollment, production, and employment of CS and CE students and also salary and demographic data for CS/CE faculty in North America.

Twitter: talk is cheep

June 11th, 2007

An article (see free version) by Mike Butcher in the UK magazine New Media Age mentioned the work of ebiquity PhD candidate Akshay Java in analyzing the usage of Twitter, a popular new ‘microblogging’ system. Using Twitter, people can write and read short blog-like posts that are distributed via mobile phones, instant messaging systems and the Web. As part of his research on social media, Java has constructed Twitterment, the first search engine for Twitter posts. NMA is a weekly magazine covering the business of interactive media — the Internet, wireless internet and interactive TV.

Bloggin bout my g-g-generation

June 8th, 2007

XKCD blogging bout my generation