UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
Swoogle

Archive for the 'Swoogle' Category

RDF molecules and lossless decompositions of RDF graphs

July 31st, 2005, by Tim Finin, posted in Semantic Web, Swoogle

Some RDF graphs can be viewed as making assertions about the world. Suppose you were given a graph, G, and asked to find supporting evidence on the web.

One approach is to search for documents with RDF graphs containing G as a sub-graph, adhering to RDF’s semantics for blank nodes and maybe applying some RDFS and OWL semantics. Even after doing that, few or maybe no RDF documents may contain *all* of G as a subgraph.

Another approach is to decompose G into its constituent triples and for each, use a Swoogle-like system to find documents containing it. But then what? The presence of blank nodes makes it difficult or impossible to assemble the support for G.

We’ve been exploring a third way using the notion of an RDF molecule. We start by computing a lossless decomposition of G into a set of subgraphs M. The decomposition is lossless in that combining the M’s elements produces the original graph G, even if their blank nodes have been renamed apart. We can then use a Swoogle-like system to search for documents supporting each molecule in M. Find support for all, we have support for G.

We suspect that the RDF molecule concept has other potential uses. For details, see

Tracking RDF Graph Provenance using RDF Molecules, Li Ding, Tim Finin, Yun Peng, Paulo Pinheiro da Silva, and Deborah McGuinness, report TR-CS-05-06, Computer Science and Electrical Engineering, University of Maryland, Baltimore County, April 30, 2005.

The Semantic Web facilitates integrating partial knowledge and finding evidence for hypothesis from web knowledge sources. However, the appropriate level of granularity for tracking provenance of RDF graph remains in debate. RDF document is too coarse since it could contain irrelevant information. RDF triple will fail when two triples share the same blank node. Therefore, this paper investigates lossless decomposition of RDF graph and tracking the provenance of RDF graph using RDF molecule, which is the finest and lossless component of an RDF graph. A sub-graph is lossless if it can be used to restore the original graph without introducing new triples. A sub-graph is finest if it cannot be further decomposed into lossless sub-graphs. The lossless decomposition algorithms and RDF molecule have been formalized and implemented by a prototype RDF graph provenance service in Swoogle project.

Stress test your RDF triple store

June 16th, 2005, by Tim Finin, posted in Semantic Web, Swoogle, Web

A colleague has been testing the scalablilty of a triple store using synthetic triples. He asked if we could package up a large collection of real triples caught in the wild by Swoogle. After talking a bit, it was decided that having them as a simple SQL database dump would be the most convenient form.

10M Triples is an SQL database dump containing a table that of about 10.4M RDF triples extracted from the Swoogle cache on June 15, 2005. The size of the compressed file is 162M and when uncompressed its size is 1.7G.

Nature red in tooth and claw

May 13th, 2005, by Tim Finin, posted in Ebiquity, Semantic Web, Swoogle, Web

Two of our AIX boxes were compromised this week, including the machine that runs most of Swoogle’s services. So, Swoogle and a few of our other research systems will be off line until sometime next week. We’re reorganizing our systems and putting more of them behind the campus firewall, leaving only the interfaces outside the firewall. This isn’t the first time we’ve had such incidents and it won’t be the last. I’m resigned that it will just be this way until the end of time — a constant struggle between the system builders and the crackers. It’s kind of depressing, and maybe that’s why humans tend to believe in an ultimate, apocalyptic day of reckoning — Armageddon, Ragnarok, Yawmid Din, Acharit Hayami — in which Good will finally triumph over Evil. I wonder what the Internet version of this would be like — I hope it’s not a darker version, like Night of the Living Dead. Anyway, look for Swoogle to be up next week.

Finding RDF instance data with Swoogle

April 24th, 2005, by Tim Finin, posted in Semantic Web, Swoogle, Web

Someone on the yahoo semanticWeb mailing list asked for “a populated ontology for countries”. I thought “Ha! This is just what Swoogle is designed for — finding RDF documents”. It turned out to not be as easy as I expected, prompting us to add a new feature. You can now use Swoogle to find RDF documents instantiating a given class or property. The results will be ranked them by the number of instances.

So, here are a two ways to find populated country ontologies. The first approach is to search for ontologies that appear to be about counties, select one, and then find documents that use it as a namespace. The second focuses on finding classes that represent countries, select one, and find documents that instantiate it.

Searching for country ontologies. Start by finding ontologies that seem to be about counties to find one that looks promising. This query asks Swoogle for ontologies (i.e., RDF documents that mostly *define* classes and properties) with RDF terms whose local names contain the lexemes ‘country’ and ‘capital’ and ‘population’. The results are ranked by Swoogle’s ontology ranking algorithm that takes into account how much each is used, so working down the list is a good strategy.

Let’s suppose we like the first one, which is based on the CIA factbook . Looking at the document view you can see a bit more about it. By entering a Swoogle namespace search, you can find all 28 documents using it as a namespace. Scanning the result summaries, you can see how many instances each defines and investigate the promising ones.

Note to self: we should add a “document’s using this namespace” link to both the document view and the document result summary

Searching for country classes. Another approach is to search by terms (i.e., classes or properties). This query asks for all classes that contain the lexeme ‘country’, ranking the results by the number of instances. Select one of the results that looks interesting, say the first. Click on the definition link to bring up a page about that term. At the top of this page there is a link ‘Documents populating this term as a class’ that, when followed, leads to a page listing documents ranked by instances of this term.

Swoogle and Swangling demonstration

April 7th, 2005, by Tim Finin, posted in Ontologies, Semantic Web, Swoogle, Web

We will demonstrate Swoogle and Swangling at the 2005 Semantic Web for National Security (SWANS) conference. The concepts and features to be demostrated are all in the Swoogle tour. You can also see the Swoogle poster and the swangling poster that we will use.

Swoogle cheat sheet

March 3rd, 2005, by Tim Finin, posted in Semantic Web, Swoogle

The Swoogle Cheat Sheet is a concise summary of Swoogle’s search synyax — i.e., what you can type into Swoogle’s search box and what it does.

Swoogle Firefox Search Plugin

March 2nd, 2005, by Akshay Java, posted in Swoogle, Web

You can now add a Swoogle search plugin for Firefox. Open this link and Firefox and it should automatically install the plugin in your browser. See here for more information.

Swoogle’s cheat sheet

February 23rd, 2005, by panrong, posted in Swoogle

Swoogle’s Cheat Sheet has just been added to our Swoogle website. It’s a list of syntaxes you can use with Swoogle’s search engine, along with some other interesting services, such as Ontology Dictionary, Swoogle Statistics, and Swoogle’s RDF Site Map. This cheat sheet could be printed in two pages, but the orientation has to be landscape.

Here are something that you may be unfamiliar with:

  • [..] in vocabulary search: Search "[cat]" for all terms of "cat";search "[cat" for all terms with "cat>" as a prefix in localname, such as "category";previous description for this.
  • Swoogle’s RDF Site Map: Search for website and its hosting RDF documents with HTML and RDF output.
  • Swoogle namespace searches

    February 15th, 2005, by Tim Finin, posted in Semantic Web, Swoogle

    We’ve added a new feature to Swoogle’s web interface that allows one to search for RDF documents that use a particular namespace. To use this, include a search term of the form ns:<NS> where <NS> is either a URI for the namespace or an abbreviation for one of the most common namespaces.

    This example query searches for all RDF documents that use the cobra namespace (ns:http://daml.umbc.edu/ontologies/cobra/0.4/). A second example (i.e. pet person ns:foaf) finds RDF documents using the FOAF namespace and containing the lexemes ‘pet’ and ‘person’. (The ‘lexemes’ are word-like components in the local name part of URIs. Swoogle maintains indexes between URIs and documents and between URIs and lexemes. Lexemes are recognized by a kind of morphological analysis in which, for example, favoritePetFood is decomposed into {favorite, pet, food}).

    Thanks to Ryusuke Masuoka for prompting us to add this namespace search feature. The namespace abbreviations that we currently recognize are:

    rdf http://www.w3.org/1999/02/22-rdf-syntax-ns
    dc http://purl.org/dc/elements/1.1
    rss1 http://purl.org/rss/1.0
    mvcb http://webns.net/mvcb
    rdfs http://www.w3.org/2000/01/rdf-schema
    foaf http://xmlns.com/foaf/0.1
    dcterms http://purl.org/dc/terms
    dctype http://purl.org/dc/dcmitype
    owl http://www.w3.org/2002/07/owl
    daml http://www.daml.org/2001/03/daml+oil

    It’s easy to add more — so let us know if you have favorites you recommend adding.

    Swoogle’s database contains much more metadata about the documents it’s discovered than it exposes in its simple web interface. We are always interested in improving the interface and have found it pretty easy to add features. We are anxious to hear from users or potential users who want to do searches they don’t find possible or easy. If that’s you, please let us know by posting a comment to one of the Swoogle forums or send email to swoogle-developer@cs.umbc.edu.

    FOAF dataset available

    January 25th, 2005, by Tim Finin, posted in Semantic Web, Swoogle

    We’ve published a foaf dataset extracted from FOAF files collected during the Fall of 2004 from our work on Swoogle. The data represents 7118 foaf documents collected from 2044 sites (identified by their symbolic IP address). A total of 201,612 RDF triples with provenance information are included. The foaf files were selected from larger datasets described in several recent papers (1, 2) to represents a interesting and balanced selection of foaf documents. This dataset is distributed under the Creative Commons Attribution (v2.0) license and packaged as a ZIP file of a SQL database export.

    On finding semantic web documents

    January 14th, 2005, by Tim Finin, posted in Semantic Web, Swoogle

    After looking at the piece on Peter Norvig’s views on the semantic web (Semantic Web Ontologies: What Works and What Doesn’t), I realized that he’s talking about a request we made when we started developing Swoogle:

    “A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn’t find them all. I looked, and it turns out there’s only around 200,000 of them. That’s about 0.005% of the web. We’ve got a ways to go.”

    We never did get any help from Google. What we did do was develop a work around to Google’s restriction of only giving 1000 results for any query, enabling us to more effectively use Google to find a set of initial seed URLs of semantic web documents (SWDs) to bootstrap the Swoogle crawler. Using these initial seeds, we employ a custom SWD crawler to crawl through SODs and a custom focused crawler to dig through HTML documents and directories. Using Swoogle, we have found on the order of two million SWDs (RDF files in XML or N3) publicly accessible on the web.

    The hack we employ is to use Google’s ’site:’ qualifier to narrow the search. So we query on “filetype:owl” and get back 1000 results drawn from many different sites. After filtering out the non OWL documents, we extract a list of the sites from which the valid ones came. For example, if http://ebiquity.umbc.edu/ontologies/event.owl is in the initial result set, we note that ‘ebiquity.umbc.edu’ is a site that had at least one OWL file. For each new site S we encounter, we give Google the narrower query ‘filetype:owl site:S’, which will often end up getting some additional results not included in the earlier query.

    For Google, a site qualifier specifies a suffix of the server’s symbolic address, so a simple refinement generates other potential site specifiers, e.g., if we find an OWL file at ‘ebiquity.umbc.edu’, we can generate the other sites (‘umbc.edu’, ‘edu’) and add them to the potential site table for querying. So, an important part of Swoogle’s database is the list of sites where we’ve found at least one SWD. Swoogle maintains a list of the top 500 sites from which we’ve extracted the most SWDs.

    There are many wrinkles to this process. For example, not every SWD use a file suffix that indicates or even suggests its type. Swoogle can also produce a current analysis of the distribution of Swoogle’s documents by suffix. The second most common suffix is nothing and the fourth is ‘.xml’. And of course, some suffixes, like ‘.rss’ only imply that the file might be an RDF file.

    While Google will only give you at most 1000 results for a query, it tried to be helpful in estimating the total number of results it could return. (Or is is taunting us?). We could use this information to inform Swoogle’s focused web crawler about how much effort to spend in rooting around in a site looking for SWDs. Currently, Swoogle’s focused crawler searches to a fixed depth and does not use this information.

    As of this writing, I’d guess there are at least two million SWDs accessible on the web. Most of these are FOAF or RSS documents. In order to keep Swoogle’s collection more interesting and representative, we’ve limited the number of documents we collect from any given site, so it purposely ignores many FOAF documents it discovers. We have develop specialized datasets with many of these ignored SWDs. Currently Swoogle has about 340K SWDs indexed.

    Note that we have a pretty narrow definition of a semantic web document — an RDF document encoded in XML or N3. There are lots of other uses of RDF content: embedded RDF in HTML documents, in other document types (e.g., PDF, JPG), in databases, etc. I think it’s hard to predict what the most important use cases will be for semantic web technologies.

    Swoogle is dead, long live Swoogle

    December 9th, 2004, by Tim Finin, posted in GENERAL, Swoogle

    Swoogle is off line for a day or two. We discovered that our Swoogle server (pear) was compromised. (Yet another PHP-Nuke vulnerability.) We have a plan to bring it back up in a more secure configuration with the database behind a firewall and only the web interface exposed to the elements. A consequence is that only the official URL, HTTP://SWOOGLE.UMBC.EDU/, will work. It’s a jungle out there.

    You are currently browsing the archives for the Swoogle category.

      Home | Archive | Login | Feed






    UMBC