SemNews: NLP system generates Semantic Web representation of news summaries

January 12th, 2006

SemNews is a prototype application being developed by UMBC Ph.D. student Akshay Java that uses a sophisticated text understanding system to interpret summaries of news stories, publishes the results on the semantic web and provides browsing and query services over them. The project is the result of a collaboration between the UMBC’s Institute for Language and Information Technologies and Ebiquity Laboratory with partial support from the Lockheed Martin Corporation.

SemNews monitors a number of news source RSS feeds and processes new stories as they are published. After extracting a story’s metadata, its news summary is interpreted by the OntoSem text analyzer which does a syntactic, semantic, and pragmatic analysis of the text, resulting in its text meaning representation or TMR. A TMR is a language-neutral description (an interlingua) of the meaning conveyed in a natural language text. In addition to providing information about the lexical-semantic dependencies in the text, the TMR represents stylistic factors, discourse relations, speaker attitudes, and other pragmatic factors present in the discourse structure. In doing so, the TMR captures not only the meaning of individual elements in the text, but also the relations between those elements, and captures both propositional and non-propositional components of textual meaning. OntoSem’s TMRs are represented in a custom frame-based representation language and grounded in the Mikrokosmos ontology, an extensive ontology with over 30K concepts and nearly 400K entities.

Each story’s metadata and TMR are translated into the Semantic Web language OWL via the OntoSem2OWL translator developed for this project. The results are then added to a special collection indexed by the Swoogle search engine and also put into a RDF triple store. These are used to support several services enabling people and agents to semantically browse, query and visualize the stories in the collection, enabling access to information that would otherwise not be easy to find using simple keyword based search.

For example, one can browse through the story collection via the ontology to find stories that involve certain concepts, such as a terrorist organization; find all stories that involve an entities in OntoSem’s onomasticon, such as al qaeda or Karbala; visualize the stories on a map based on the locations they reference; or construct an arbitrary query, such as finding “stories in which the nation named Afghanistan was the location of a bombing event.” Users can also define semantic “alerts” as queries over the RDF triple store and/or the Swoogle collection. For each alert, SemNews will generate an RSS feed of the results.

The SemNews system is currently a research prototype that is being used to refine the underlying technologies and to explore how the sophisticated automatic linguistic processing of text can be integrated into the Semantic Web and conventional web applications. Ongoing work on SemNews includes an evaluation of its semantic recall and precision as well as a service that can group and cluster stories based on their semantic representations.

For more information

RDF molecules and lossless decompositions of RDF graphs

July 31st, 2005

Some RDF graphs can be viewed as making assertions about the world. Suppose you were given a graph, G, and asked to find supporting evidence on the web.

One approach is to search for documents with RDF graphs containing G as a sub-graph, adhering to RDF’s semantics for blank nodes and maybe applying some RDFS and OWL semantics. Even after doing that, few or maybe no RDF documents may contain *all* of G as a subgraph.

Another approach is to decompose G into its constituent triples and for each, use a Swoogle-like system to find documents containing it. But then what? The presence of blank nodes makes it difficult or impossible to assemble the support for G.

We’ve been exploring a third way using the notion of an RDF molecule. We start by computing a lossless decomposition of G into a set of subgraphs M. The decomposition is lossless in that combining the M‘s elements produces the original graph G, even if their blank nodes have been renamed apart. We can then use a Swoogle-like system to search for documents supporting each molecule in M. Find support for all, we have support for G.

We suspect that the RDF molecule concept has other potential uses. For details, see

Tracking RDF Graph Provenance using RDF Molecules, Li Ding, Tim Finin, Yun Peng, Paulo Pinheiro da Silva, and Deborah McGuinness, report TR-CS-05-06, Computer Science and Electrical Engineering, University of Maryland, Baltimore County, April 30, 2005.

The Semantic Web facilitates integrating partial knowledge and finding evidence for hypothesis from web knowledge sources. However, the appropriate level of granularity for tracking provenance of RDF graph remains in debate. RDF document is too coarse since it could contain irrelevant information. RDF triple will fail when two triples share the same blank node. Therefore, this paper investigates lossless decomposition of RDF graph and tracking the provenance of RDF graph using RDF molecule, which is the finest and lossless component of an RDF graph. A sub-graph is lossless if it can be used to restore the original graph without introducing new triples. A sub-graph is finest if it cannot be further decomposed into lossless sub-graphs. The lossless decomposition algorithms and RDF molecule have been formalized and implemented by a prototype RDF graph provenance service in Swoogle project.

Stress test your RDF triple store

June 16th, 2005

A colleague has been testing the scalablilty of a triple store using synthetic triples. He asked if we could package up a large collection of real triples caught in the wild by Swoogle. After talking a bit, it was decided that having them as a simple SQL database dump would be the most convenient form.

10M Triples is an SQL database dump containing a table that of about 10.4M RDF triples extracted from the Swoogle cache on June 15, 2005. The size of the compressed file is 162M and when uncompressed its size is 1.7G.

Nature red in tooth and claw

May 13th, 2005

Two of our AIX boxes were compromised this week, including the machine that runs most of Swoogle’s services. So, Swoogle and a few of our other research systems will be off line until sometime next week. We’re reorganizing our systems and putting more of them behind the campus firewall, leaving only the interfaces outside the firewall. This isn’t the first time we’ve had such incidents and it won’t be the last. I’m resigned that it will just be this way until the end of time — a constant struggle between the system builders and the crackers. It’s kind of depressing, and maybe that’s why humans tend to believe in an ultimate, apocalyptic day of reckoning — Armageddon, Ragnarok, Yawmid Din, Acharit Hayami — in which Good will finally triumph over Evil. I wonder what the Internet version of this would be like — I hope it’s not a darker version, like Night of the Living Dead. Anyway, look for Swoogle to be up next week.

Finding RDF instance data with Swoogle

April 24th, 2005

Someone on the yahoo semanticWeb mailing list asked for “a populated ontology for countries”. I thought “Ha! This is just what Swoogle is designed for — finding RDF documents”. It turned out to not be as easy as I expected, prompting us to add a new feature. You can now use Swoogle to find RDF documents instantiating a given class or property. The results will be ranked them by the number of instances.

So, here are a two ways to find populated country ontologies. The first approach is to search for ontologies that appear to be about counties, select one, and then find documents that use it as a namespace. The second focuses on finding classes that represent countries, select one, and find documents that instantiate it.

Searching for country ontologies. Start by finding ontologies that seem to be about counties to find one that looks promising. This query asks Swoogle for ontologies (i.e., RDF documents that mostly *define* classes and properties) with RDF terms whose local names contain the lexemes ‘country’ and ‘capital’ and ‘population’. The results are ranked by Swoogle’s ontology ranking algorithm that takes into account how much each is used, so working down the list is a good strategy.

Let’s suppose we like the first one, which is based on the CIA factbook . Looking at the document view you can see a bit more about it. By entering a Swoogle namespace search, you can find all 28 documents using it as a namespace. Scanning the result summaries, you can see how many instances each defines and investigate the promising ones.

Note to self: we should add a “document’s using this namespace” link to both the document view and the document result summary

Searching for country classes. Another approach is to search by terms (i.e., classes or properties). This query asks for all classes that contain the lexeme ‘country’, ranking the results by the number of instances. Select one of the results that looks interesting, say the first. Click on the definition link to bring up a page about that term. At the top of this page there is a link ‘Documents populating this term as a class’ that, when followed, leads to a page listing documents ranked by instances of this term.

Swoogle and Swangling demonstration

April 7th, 2005

We will demonstrate Swoogle and Swangling at the 2005 Semantic Web for National Security (SWANS) conference. The concepts and features to be demostrated are all in the Swoogle tour. You can also see the Swoogle poster and the swangling poster that we will use.

Swoogle cheat sheet

March 3rd, 2005

The Swoogle Cheat Sheet is a concise summary of Swoogle’s search synyax — i.e., what you can type into Swoogle’s search box and what it does.

Swoogle Firefox Search Plugin

March 2nd, 2005

You can now add a Swoogle search plugin for Firefox. Open this link and Firefox and it should automatically install the plugin in your browser. See here for more information.

Swoogle’s cheat sheet

February 23rd, 2005

Swoogle’s Cheat Sheet has just been added to our Swoogle website. It’s a list of syntaxes you can use with Swoogle’s search engine, along with some other interesting services, such as Ontology Dictionary, Swoogle Statistics, and Swoogle’s RDF Site Map. This cheat sheet could be printed in two pages, but the orientation has to be landscape.

Here are something that you may be unfamiliar with:

  • [..] in vocabulary search: Search "[cat]" for all terms of "cat";search "[cat" for all terms with "cat>" as a prefix in localname, such as "category";search "cat]" for with "cat" as a suffix in localname, such as "Domestic_cat".
  • ns: in document search: search document by the namespaces it used. We already have a previous description for this.
  • Swoogle’s RDF Site Map: Search for website and its hosting RDF documents with HTML and RDF output.

  • Swoogle namespace searches

    February 15th, 2005

    We’ve added a new feature to Swoogle’s web interface that allows one to search for RDF documents that use a particular namespace. To use this, include a search term of the form ns:<NS> where <NS> is either a URI for the namespace or an abbreviation for one of the most common namespaces.

    This example query searches for all RDF documents that use the cobra namespace (ns: A second example (i.e. pet person ns:foaf) finds RDF documents using the FOAF namespace and containing the lexemes ‘pet’ and ‘person’. (The ‘lexemes’ are word-like components in the local name part of URIs. Swoogle maintains indexes between URIs and documents and between URIs and lexemes. Lexemes are recognized by a kind of morphological analysis in which, for example, favoritePetFood is decomposed into {favorite, pet, food}).

    Thanks to Ryusuke Masuoka for prompting us to add this namespace search feature. The namespace abbreviations that we currently recognize are:


    It’s easy to add more — so let us know if you have favorites you recommend adding.

    Swoogle’s database contains much more metadata about the documents it’s discovered than it exposes in its simple web interface. We are always interested in improving the interface and have found it pretty easy to add features. We are anxious to hear from users or potential users who want to do searches they don’t find possible or easy. If that’s you, please let us know by posting a comment to one of the Swoogle forums or send email to

    FOAF dataset available

    January 25th, 2005

    We’ve published a foaf dataset extracted from FOAF files collected during the Fall of 2004 from our work on Swoogle. The data represents 7118 foaf documents collected from 2044 sites (identified by their symbolic IP address). A total of 201,612 RDF triples with provenance information are included. The foaf files were selected from larger datasets described in several recent papers (1, 2) to represents a interesting and balanced selection of foaf documents. This dataset is distributed under the Creative Commons Attribution (v2.0) license and packaged as a ZIP file of a SQL database export.

    On finding semantic web documents

    January 14th, 2005

    After looking at the piece on Peter Norvig’s views on the semantic web (Semantic Web Ontologies: What Works and What Doesn’t), I realized that he’s talking about a request we made when we started developing Swoogle:

    “A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn’t find them all. I looked, and it turns out there’s only around 200,000 of them. That’s about 0.005% of the web. We’ve got a ways to go.”

    We never did get any help from Google. What we did do was develop a work around to Google’s restriction of only giving 1000 results for any query, enabling us to more effectively use Google to find a set of initial seed URLs of semantic web documents (SWDs) to bootstrap the Swoogle crawler. Using these initial seeds, we employ a custom SWD crawler to crawl through SODs and a custom focused crawler to dig through HTML documents and directories. Using Swoogle, we have found on the order of two million SWDs (RDF files in XML or N3) publicly accessible on the web.

    The hack we employ is to use Google’s ‘site:’ qualifier to narrow the search. So we query on “filetype:owl” and get back 1000 results drawn from many different sites. After filtering out the non OWL documents, we extract a list of the sites from which the valid ones came. For example, if is in the initial result set, we note that ‘’ is a site that had at least one OWL file. For each new site S we encounter, we give Google the narrower query ‘filetype:owl site:S’, which will often end up getting some additional results not included in the earlier query.

    For Google, a site qualifier specifies a suffix of the server’s symbolic address, so a simple refinement generates other potential site specifiers, e.g., if we find an OWL file at ‘’, we can generate the other sites (‘’, ‘edu’) and add them to the potential site table for querying. So, an important part of Swoogle’s database is the list of sites where we’ve found at least one SWD. Swoogle maintains a list of the top 500 sites from which we’ve extracted the most SWDs.

    There are many wrinkles to this process. For example, not every SWD use a file suffix that indicates or even suggests its type. Swoogle can also produce a current analysis of the distribution of Swoogle’s documents by suffix. The second most common suffix is nothing and the fourth is ‘.xml’. And of course, some suffixes, like ‘.rss’ only imply that the file might be an RDF file.

    While Google will only give you at most 1000 results for a query, it tried to be helpful in estimating the total number of results it could return. (Or is is taunting us?). We could use this information to inform Swoogle’s focused web crawler about how much effort to spend in rooting around in a site looking for SWDs. Currently, Swoogle’s focused crawler searches to a fixed depth and does not use this information.

    As of this writing, I’d guess there are at least two million SWDs accessible on the web. Most of these are FOAF or RSS documents. In order to keep Swoogle’s collection more interesting and representative, we’ve limited the number of documents we collect from any given site, so it purposely ignores many FOAF documents it discovers. We have develop specialized datasets with many of these ignored SWDs. Currently Swoogle has about 340K SWDs indexed.

    Note that we have a pretty narrow definition of a semantic web document — an RDF document encoded in XML or N3. There are lots of other uses of RDF content: embedded RDF in HTML documents, in other document types (e.g., PDF, JPG), in databases, etc. I think it’s hard to predict what the most important use cases will be for semantic web technologies.