Deadlines for submitting papers, Doctoral Consortium applications and tutorial proposals for the Seventh International Semantic Web Conference are fast approaching. ISWC ‘08 will be held 26-30 October 2008 in Karlsruhe, Germany. Key upcoming dates include:
Research papers: due May 9 (title and abstracts), 16 May (full)
Semantic Web in Use papers: due May 16
Tutorial proposals: May 16
Doctoral Consortium applications: due May 16
Posters & Demo proposals: due July 25
Workshops papers (13 workshops): mid-summer
Semantic Web & Billion Triples challenge: Oct 1
ISWC 2008 CONFERENCE: October 26-30
See the ISWC 2008 site for CFPs and other details. Inquires about specific tracks should be sent to the appropriate chairs. Send general questions and suggestions for panel topics, invited speakers, birds of a feather meetings, etc. to iswc08@gmail.com.
“We invite submissions to the sixth annual Semantic Web Challenge, the premiere event for demonstrating practical progress towards achieving the vision of the Semantic Web. The central idea of the Semantic Web is to extend the current human-readable web by encoding some of the semantics of resources in a machine-processable form. Moving beyond syntax opens the door to more advanced applications and functionality on the Web. Computers will be better able to search, process, integrate and present the content of these resources in a meaningful, intelligent manner.
As the core technological building blocks are now in place, the next challenge is to show off the benefits of semantic technologies by developing integrated, easy to use applications that can provide new levels of Web functionality for end users on the Web or within enterprise settings. Applications submitted should demonstrate clear practical value that goes above and beyond what is possible with conventional web technologies alone.
Unlike in previous years, the Semantic Web Challenge of 2008 will consist of two tracks: the Open Track and the Billion Triples Track. The key difference between the two tracks is that the Billion Triples Track requires the participants to make use of the data set –a billion triples– provided by the organizers. The Open Track has no such restrictions.
As before, the Challenge is open to everyone from academia and industry. The authors of the best applications will be awarded prizes and featured prominently at special sessions during the conference”
“Research in data mining has led to advanced knowledge discovery technologies and applications. In this talk, we will discuss some emerging research issues for advanced technologies and applications in data mining and discuss some recent progress in this direction, including (1) exploration of the power of pattern mining, (2) analysis of multidimensional, heterogeneous and evolving information network, (3) mining of fast changing data streams, (4) mining of moving object data, RFID data, and data from sensor networks, (5) spatiotemporal and multimedia data mining, (6) biological data mining, (7) text and Web mining, (8) data mining for software engineering and computer system analysis, and (9) data cube-oriented multidimensional online analytical analysis.”
Tomas Rokicki has written up a proof that any Rubik’s Cube configuration can be solved in 25 or fewer moves. In his paper, Twenty-Five Moves Suffice for Rubik’s Cube, Rokicki proves that there are no configurations that can be solved in exactly 26 moves. Taken with earlier results, this means that 25 movies should suffice for any solution.
“How many moves does it take to solve Rubik’s Cube? Positions are known that require 20 moves, and it has already been shown that there are no positions that require 27 or more moves; this is a surprisingly large gap. This paper describes a program that is able to find solutions of length 20 or less at a rate of more than 16 million positions a second. We use this program, along with some new ideas and incremental improvements in other techniques, to show that there is no position that requires 26 moves.”
“Rokicki’s proof is a neat piece of computer science. He’s used the symmetry of the cube to study transformations of the cube in sets, rather than as individual moves. This allows him to separate the “cube space” into 2 billion sets each containing 20 billion elements. He then shows that a large number of these sets are essentially equivalent to other sets and so can be ignored. Even then, to crunch through the remaining sets, he needed a workstation with 8GB of memory and around 1500 hours of time on a Q6600 CPU running at 1.6GHz.”
Rokicki is working to establish a bound of 24 moves and thinks that a bound of 20 can eventually be proved.
Language models are widely used in processing both written and spoken language. They are used for part of speech tagging, sense tagging, disambiguation, text similarity metrics, and many other tasks, including predicting the words a person intends when typing on a telephone keypad. The last application has some interesting wrinkles, as this video we spotted on Language Log explains.
The most popular predictive text system in use today is T9, developed by Nuance Communications. You can check out the video’s examples using this T9 demo.
Social Networks and Web graphs exhibit certain typical properties. The classic work by Barabási–Albert showed how nodes in such network link preferentially — popular nodes often gain disproportionately larger share of the links. This is also known in other fields as the 80/20 rule or simply the “rich get richer phenomenon“. Another early work by Steve Borgatti studied social networks and found that they exhibit a core-periphery property. A small set of (popular) nodes form the core and the rest comprise of the peripheral nodes. To the best of my knowledge, community detection algorithms have often worked independent of such underlying network properties.
I have been exploring an idea that can utilize the core-periphery structure of social networks to approximately compute the communities in the graph. The intuition behind this method is really quite simple. The basic idea boils down to the following:
“The core of the social network typically defines the communities present in it. By looking at the link structure of the core and identifying how the rest of the network connects to the core we can efficiently compute communities in large graphs.”
This idea can be easily explained by considering the following network of email communication (obtained from Dr. Mark Newman’s site). The original adjacency matrix was permuted to order the nodes based on their degree. Thus the core is represented by submatrix A which is quite dense. The submatrix B, here corresponds to how the rest of the network links to its core. The submatrix C is a very sparse matrix that consists of links between nodes in the long tail. Since C is quite sparse, it can be ignored without much degradation of the clustering/community detection results. Thus it leads to saving a significant amount of computation and storage. By utilizing just the core of the social network (matrix A) and how other nodes link to the core (matrix B) we can approximate the overall community structure of the entire graph, much more efficiently.
The rest boils down the to the mathematical formulation of the above idea using Spectral clustering techniques. You can read more about it in my poster paper that was recently accepted to ICWSM. (A Tech Report version with a more detailed analysis would be available shortly)
“At Money:Tech yesterday, I did an on-stage interview with Devin Wenig, the charismatic CEO-to-be of Reuters (following the still-not completed merger with Thomson). Devin highlighted what he considers two big trends hitting financial (and other professional) data: … The end of benefits from decreasing the time it takes for news to hit the market. … he increasingly sees Reuters’ job to be making connections, going from news to insight. He sees semantic markup to make it easier to follow paths of meaning through the data as an important part of Reuters’ future. … Ultimately, Reuters’ news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don’t think of what you produce as the “final product” but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters’ case, Devin thinks you add hooks to make your information more programmable.”
This provides some background for their recent announcement of the Reuters Calais information extraction service. It extracts named entities, events and relations from text and returns the information as RDF data.
“DARPA, the Pentagon’s mad science division, got a $324 million boost in the Defense Department’s new budget — a ten percent increase. Which means lots more cash for giant blimps, next-gen wireless networks, Mach 6 planes, shape-shifting drones, and improvised bomb-beaters. … But not everything in the DARPA budget got bumped up. The agency’s much-ballyhooed efforts at “Cognitive Computing” took at $30 million cut, to $145 million. Which could mean that even the Pentagon’s most wide-eyed visionaries see thinking machines are still far, far off in the distance.” (link)
DARPA has traditionally been an important funding source for basic computer science research. While the ORCA program got a healthy increase of $53M, this is the only CS-related program mentioned.
“Their innovations transformed this approach from a theoretical technique to a highly effective verification technology that enables computer hardware and software engineers to find errors efficiently in complex system designs. This transformation has resulted in increased assurance that the systems perform as intended by the designers. … Clarke of Carnegie Mellon University, and Emerson of the University of Texas at Austin, working together, and Sifakis, working independently for the Centre National de la Recherche Scientifique at the University of Grenoble in France, developed this fully automated approach that is now the most widely used verification method in the hardware and software industries.” (link)
Reuters has released an API for its Calais Web service. The free service discovers entities, events and relations in text and returns the results in the form of RDF data. The services use information extraction technology from ClearForest, which Reuters acquired in April 2007.
“The Calais web service automatically attaches rich semantic metadata to the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais categorizes and links your document with entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), and events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID). Using the Calais GUID, any downstream consumer is able to retrieve this metadata via a simple call to Calais.” (link)
The semantic types it recognizes and uses in its annotations are a basic set typical of information extraction systems and include entities, facts, events and categories. See, for example, the description of the person entity type. The brief API documentation describes how to call the web services and interpret the results. As an example of the semantic metadata types supported by Calais, a preprocessed a sample content set of about 350 Business and Economic news articles from WikiNews for the year 2007 is available.
The service is free for both commercial and non-commercial purposes with a limit, but a generous one, on the number of service calls a registered developer can make in a day. A sample Java application is available that reads input from STDIN, writes output to STDOUT and takes processing parameters from a configuration file.
updates: The sample application requires Java 6 to run! Here’s an example of input and the RDF output.
Making such a service freely available on the Web has the potential to be a disruptive move. Reuters will sponsor “a number of contests and bounties for applications developed using the Calais API.” An initial “bounty” of $5,000 is offered for “A highly configurable plugin for WordPress that enriches a blog with several capabilities” based on OpenCalais.
The kind of content extraction that Calias does falls considerably short of full language understanding. However, it does represent the state of the art in scalable, domain-independent information extraction, is immediately useful, and an important step toward the ultimate goal of full NLP.
On Tuesday 22 January the agents mailing list (agents@cs.umbc.edu) will be offline between 21:00 and 23:00 UTC as we transition from Majordomo to GNU Mailman. Mail sent to the list at this time will bounce.
The agents list was begun in 1994 by Ray Johnson, then at the Lockheed Palo Alto AI Center and moved to UMBC in 1996. Majordomo represented the state of the art for mailing list software in 1996, but development stopped sometime around 2001. Moving to Mailman will make it easier for us to manage the list and let users manage a wider range of their own subscription options. The list currently has about 2000 subscribers.
If you are a subscriber to either the UMBC agents or agents-digest lists, your subscription will be transferred to the new Mailman-supported list. Subscribers to the old agents-digest list will get a daily digest of messages. Using the agents administration page you can elect to receive messages as they are sent or to get them in digest form. We’ve assigned subscribers random passwords, so you will need to recover your password before making any changes.
You can edit your Mailman configuration now, but we won’t start sending out mail using Mailman until the Tuesday evening. I’ll send out an announcement via the re-hosted list when I know it’s enabled.
An address entered in the Mailman admin page must match your subscribed address exactly. If you are not sure which of your email address is subscribed, check the message headers to see if that reveals it. Failing that, you can try asking the old system by sending poor old majordomo@cs.umbc.edu an email message with the command “which ” in the message body, where is a string you believe to be in your subscribed address. As a last resort, ask me for help (finin@cs.umbc.edu).
You can continue to send mail to the list agents mailing list using the address agents at cs.umbc.edu. If the sending address is recognized as a subscriber, your message will distributed immediately and without moderation. Otherwise, you will be notified that your it awaits moderation, which might take a day or two.
In our old majordomo system, we maintained a separate list of additional pre-approved sending addresses. In general, if your sending address is not the same as your subscribed address, you should change the subscribed address. If you want to be able to send unmoderated messages from several accounts (e.g., your .edu and gmail accounts), you can always subscribe all of your accounts and disable email delivery for all but one.
Messages sent through the Mailman system will be available in an archive. The archive of old majordomo-era traffic is in disarray, but I think we have virtually all of the messages from 1994-2007. Eventually we’ll get it sorted out and online for posterity.
Our old moderation list was so inundated with spam and bounces from bad addresses that it became virtually impossible to moderate effectively. We anticipate that the new system will address both of these problems well and we will be thus be able to manage the moderation process better.
You can get more information about the list as well as manage subscriptions on the admin page and from the Mailman user guide. There are sure to be a few issues when we start using Mailman. If you have questions or suggestions about the list configuration, please let me know or send a message to the list if you think it should be of interest to the community.
January 18th, 2008, by Tim Finin, posted in Humor, AI, Agents
Guaranteeing that you can take a hot shower is NP complete, at lest in one formalization the problem by Christina Matzke and Damien Challet in a recent paper.
Christina Matzke, Damien Challet, Taking a shower in Youth Hostels: risks and delights of heterogeneity, arXiv:0801.1573v1 , 10 January, 2008. … Tuning one’s shower in some hotels may turn into a challenging coordination game with imperfect information. The temperature sensitivity increases with the number of agents, making the problem possibly unlearnable. Because there is in practice a finite number of possible tap positions, identical agents are unlikely to reach even approximately their favorite water temperature. Heterogeneity allows some agents to reach much better temperatures, at the cost of higher risk.