UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
16 May 2008, 23:44:58 EDT  
Swoogle

Archive for the 'Swoogle' Category

Joel Sachs on Linked Data, 10:30am Oct 1, ITE 325b

September 27th, 2007, by Tim Finin, posted in RDF, Swoogle, Web, Semantic Web

I thought I would start blogging about our weekly ebiquity meetings, at least for those that might be of interest to people outside of our group. Our meetings are, in general, open and we are happy to have visitors, with or without warning. We meet on Monday mornings, from 10:30 to 11:30 or Noon, depending on the topic, in our department’s large conference room (325b ITE Building). This coming week (October 1) Joel Sachs will give us a tutorial on Linked Data. Here’s his abstract.

Linked Data refers to a collection of best practices for publishing data on the semantic web. It is also, in part, a re-branding of the semantic web itself, with less emphasis on semantics, and more on RDF linkages amongst data sources. Also heavily emphasized is the proper role of web architecture (http requests and responses; 303 redirects; etc.), and the distinction between information resources (those that physically reside on the web), and non-information resources (those that exist in the so-called real world). I’ll give a brief overview of Linked Data, followed by a discussion of some issues that Linked Data raises for the SPIRE project. These issues include how Swoogle should handle information sources such as DBpedia, and how to link ETHAN to other sources of taxonomic and natural history information.

Swoogle: over 1,000,000 Semantic Web documents

February 6th, 2006, by Tim Finin, posted in OWL, RDF, Swoogle, Web, Semantic Web

Sometime today the UMBC Swoogle Semantic Web search engine discovered and indexed its millionth document. Of these, about 77% are valid RDF documents, 15% HTML documents with embedded RDF and 8% appear to be RDF documents but can not be parsed.

Is my document indexed by Swoogle?

February 6th, 2006, by li ding, posted in Swoogle, Ontologies, AI, Web, Semantic Web

“Swoogle has indexed millions of Semantic Web Documents, but how do I know that mine has been indexed?” Here is a simple way - please try your URL using Swoogle Track Back Service. Here I list several example to show how it works:

  • It helps us track the evolution of an ontology - say the protégé ontology
  • http://protege.stanford.edu/plugins/owl/protege
    ——————————————————————————–
    About this URL
    The latest ping on [2006-01-29] shows its status is [Succeed, changed into SWD].
    Its latest cached original snapshot is [2006-01-29 (3373 bytes)]
    Its latest cached NTriples snapshot is [2006-01-29 (41 triples)].
    ——————————————————————————–
    We have found 7 cached versions.
    2006-01-29: Original Snapshot (3373 bytes), NTriples Snapshot (41 triples)
    2005-08-25: Original Snapshot (3373 bytes), NTriples Snapshot (41 triples)
    2005-07-16: Original Snapshot (2439 bytes), NTriples Snapshot (35 triples)
    2005-05-20: Original Snapshot (2173 bytes), NTriples Snapshot (30 triples)
    2005-04-10: Original Snapshot (1909 bytes), NTriples Snapshot (28 triples)
    2005-02-25: Original Snapshot (1869 bytes), NTriples Snapshot (27 triples)
    2005-01-24: Original Snapshot, NTriples Snapshot (31 triples)

  • We may also check the growth of FOAF documents.
  • http://www.csee.umbc.edu/~dingli1/foaf.rdf
    ——————————————————————————–
    About this URL
    The latest ping on [2006-01-29] shows its status is [Succeed, changed into SWD].
    Its latest cached original snapshot is [2006-01-29 (6072 bytes)]
    Its latest cached NTriples snapshot is [2006-01-29 (98 triples)].
    ——————————————————————————–
    We have found 6 cached versions.
    2006-01-29: Original Snapshot (6072 bytes), NTriples Snapshot (98 triples)
    2005-07-16: Original Snapshot (6072 bytes), NTriples Snapshot (98 triples)
    2005-06-19: Original Snapshot (5053 bytes), NTriples Snapshot (80 triples)
    2005-04-17: Original Snapshot (3142 bytes), NTriples Snapshot (50 triples)
    2005-04-01: Original Snapshot (1761 bytes), NTriples Snapshot (29 triples)
    2005-01-24: Original Snapshot, NTriples Snapshot (29 triples)

  • Finally, this service may also help us learn the life cycle of a semantic web document: it was created, actively maintained, lingered around for a while and finally died (i.e. went offline).
  • http://simile.mit.edu/repository/fresnel/style.rdfs.n3
    ——————————————————————————–
    About this URL
    The latest ping on [2006-02-02] shows its status is [Failed, http code is not 200 (or406)].
    Its latest cached original snapshot is [2005-03-09 (15809 bytes)]
    Its latest cached NTriples snapshot is [2005-03-09 (149 triples)].
    ——————————————————————————–
    We have found 3 cached versions.
    2005-03-09: Original Snapshot (15809 bytes), NTriples Snapshot (149 triples)
    2005-02-25: Original Snapshot (12043 bytes), NTriples Snapshot (149 triples)
    2005-01-26: Original Snapshot, NTriples Snapshot (145 triples)

NOTICE: Yesterday we posted a form that direct you to Swoogle trackback service. Unfortunately, the form failed when it was called outside our firewall because a Swoogle API key is required. We didn’t notice at first, because we were inside the firewall when we tested it. When we did, we deleted the post, but PlanetRDF had already picked up the post and it was still in our database. Now the form has been removed, but you can definitely go to swoolge web site and try trackback service there.

Well, Is my document indexed by Swoogle or not?!

February 6th, 2006, by Tim Finin, posted in RDF, Swoogle, Semantic Web

Yesterday we posted directions on how to tell if your Semantic Web document is in Swoogle’s database. Unfortunately, our directions suggested using a service that, if called outside our firewall, requires a Swoogle API key. (This is seperate from being a registered Swoogle user.) We didn’t notice at first, because we were inside the firewall when we tested it. When we did, we deleted the post, but PlanetRDF had already picked up the post and it was still in our database. We’re working to straighten this out and hope to have the service available soon.

Half of Swoogle’s hits are from referer log spammers

February 4th, 2006, by Tim Finin, posted in splog, Swoogle, Blogging, Web, Semantic Web

We are using bbclone to generate reports on Swoogle access. Look at today’s top 10 referers as of 3:00pm:

  www.legaladvocate.net  246     26.14%
  www.myjavaserver.com   152     16.15%
  www.google.com         125     13.28%
  dannyayers.com         44      4.68%
  lucky7.to              34      3.61%
  ebiquity.umbc.edu      25      2.66%
  www.google.de          18      1.91%
  planetrdf.com          18      1.91%
  mail.google.com        18      1.91%
  groups.google.com      14      1.49%

One and five are clearly spam sites and two is suspicious, too. The first, for example, appears to be about poker, though the site name is legaladvocat. The site’s text is obviously automatically generated nonsense. All of the links point to subpages in the same domain with a similar structure and content. I assume that once the site achineves a high pageRank, it will be repurposed or sold.

So, it seems like nearly 50% of our hits are due to referer log spamming. I’d guess Swoogle was picked by finding its URL on recent posts found on a blog search engine or a ping server.

Fininding foaf instances

February 4th, 2006, by li ding, posted in RDF, Swoogle, Web, Semantic Web, GENERAL
  foaf foaf

Foaf is a well-known semantic web practice on the Web, and we know that there are millions of FOAF instances on the Web. A scutter can help use to recursively find foaf documents online using hyperlinks in foaf documents; however, how to obtain the initial seeds is still a big issue.

In addition, many semantic web users would like to find out the population of ontology, e.g. the instances of a defined class such as foaf:Person, or where foaf:email has been populated as predicate.

Therefore, Swoogle provide an interesting interface supporting finding instances of a class such as foaf:Person.

This swoogle query searches the usage of a semantic web term, foaf:Person.

Meaning?
Its result consists of six exclusive categories:

  • definesClass: the term has been defined as a class in the ontology
  • definesProperty: the term has been defined as a property in the ontology
  • populatesClass: there is a class-instance of that term in the document
  • populatesProperty: the term has been used as a predicate (i.e. populated) in the document
  • usesClass: the term has been used (neither defined or populated) as a class in an document. e.g. when an ontology asserts myns:Person rdfs:subClassOf foaf:Person.
  • usesProperty: the term has been used as a property

Note that a document might have multiple usage relation with a term, e.g. a document both defines a term as a class, uses it to define other classes and properties, and populates its class-instances.

How to get there
In order to access that page, please follow the following steps:

  1. start from swoogle home page, choose “search term”
  2. type the localname or the entire URI surrounded by double qoute, and move to search result page
  3. click “metadata” link under your URI, and move to term’s metadata page
  4. click “related documents” link (a grey block) on the top of the page, and move to the wanted page

NOTE: advanced users may use swoogle web service APIs to retrieve more results.

Swooglers group for Swoogle users

February 2nd, 2006, by Tim Finin, posted in RDF, OWL, Swoogle, Ontologies, Web, Semantic Web

We’ve set up a Google group, Swooglers, for users of the Swoogle Semantic Web search engine. Anyone can browse the archived and join, but only members can post messages. Replies are sent to the whole group. We’re not exactly sure what Swooglers will have to talk about, but it might be a place to share your experiences in using Swoogle, ask other users for advice, etc.

Swoogle, Groundhog Day edition

February 2nd, 2006, by Tim Finin, posted in OWL, RDF, Swoogle, Web, Semantic Web

If you go to Swoogle on this Groundhog’s Day you will see a change. We’ve released a new version, Swoogle 2006, that is a nearly complete rewrite of Swoogle Classic, which now answers to Swoogle 2005. While Swoogle is currently missing some of Swoogle 2005’s features, it enjoys a cleaner and simpler model and foundation. We will be adding in some of these features as well as new ones over the next few months. Here are some of Swoogle 2006’s highlights:

  • New hardware. Swoogle 2006 is running on a set of three machines: EB2 is a two processor Sun v20z with 4G of memory and runs the crawler, DBMS and development web interfaces; LOGOS is an IBM eserver runs the production web interfaces, and NATRAJ is the file server for the SW cache and archive.
  • More data. Swoogle 2006 has over 850K documents in its index compared to Swoogle 2005’s 340K. The documents include about 700K RDF documents and 140K HTML documents with embedded RDF.
  • Better ranking. Swoogle 2006 uses the improved ranking algorithms reported on in our ISWC 2005 paper.
  • Better crawling. Swoogle 2006 now does a better job of crawling new URLs, including those submitted by people.
  • Web services. Swoogle 2006 exposes a set of 17 web services, currently with simple GCI interfaces that return their results as RDF graph. Using the web services requires the use of a key, so we can track usage and possible abuses.
  • RDF output. All query results, whether via a web service call or through the browser interface, are available in RDF. For browser-based queries, look for the RDF VERSION link in the upper left corner of the page.
  • Simpler interface. The human web interface is simpler and cleaner.
  • Cache and archive. Swoogle 2006 maintains a cache of the SW documents it finds and also keeps copies of older versions in it’s Semantic Web Archive .
  • Registered user services. Swoogle 2006 has a better system for user accounts that includes a CAPCHA to keep out spambots. Anonymous users only see a limited number of query results where as registered users can see them all.
  • Development wiki. We have a wiki for swoogle development ideas and discussion.

Some of the Swoogle 2005 features currently missing from Swoogle 2006 are the shopping cart and triple shop; the ontology dictionary; swoogle statistics and swoogle’s top ten. We plan to add these back into Swoogle 2006 over the next few months. Send any comments to swoogle-developers at ebiquity.umbc.edu.

Large RDF and OWL documents on the Semantic Web

January 26th, 2006, by Tim Finin, posted in RDF, OWL, Swoogle, Ontologies, Web, Semantic Web

Recently Cláudio Fernandes asked on several semantic web mailing lists

“Can someone point me to some huge owl/rdf files? I’m writing a owl parser with different tools, and I’d like to benchmark them all with some really really big files.”

I just ran some queries over Swoogle’s collection of 850K RDF documents collected from the web. Here are the 100 largest RDF documents and OWL documents, respectively. Document size was measured in terms of the number of triples. For this query, a document was considered to be an OWL document if it used a namespace that contained the string OWL.

Curently, the version of Swoogle you get by going to http://swoogle.umbc.edu/ is Swoogle 2. Its database has been trapped in amber since last summer, when it was corrupted, preventing us from adding new data. We put our efforts into a reimplementation, Swoogle 3, which will be released early next week. The data reported here is from Swoogle 3’s database.

SemNews: NLP system generates Semantic Web representation of news summaries

January 12th, 2006, by Tim Finin, posted in Swoogle, NLP, Ontologies, AI, Web, Semantic Web

SemNews is a prototype application being developed by UMBC Ph.D. student Akshay Java that uses a sophisticated text understanding system to interpret summaries of news stories, publishes the results on the semantic web and provides browsing and query services over them. The project is the result of a collaboration between the UMBC’s Institute for Language and Information Technologies and Ebiquity Laboratory with partial support from the Lockheed Martin Corporation.

SemNews monitors a number of news source RSS feeds and processes new stories as they are published. After extracting a story’s metadata, its news summary is interpreted by the OntoSem text analyzer which does a syntactic, semantic, and pragmatic analysis of the text, resulting in its text meaning representation or TMR. A TMR is a language-neutral description (an interlingua) of the meaning conveyed in a natural language text. In addition to providing information about the lexical-semantic dependencies in the text, the TMR represents stylistic factors, discourse relations, speaker attitudes, and other pragmatic factors present in the discourse structure. In doing so, the TMR captures not only the meaning of individual elements in the text, but also the relations between those elements, and captures both propositional and non-propositional components of textual meaning. OntoSem’s TMRs are represented in a custom frame-based representation language and grounded in the Mikrokosmos ontology, an extensive ontology with over 30K concepts and nearly 400K entities.

Each story’s metadata and TMR are translated into the Semantic Web language OWL via the OntoSem2OWL translator developed for this project. The results are then added to a special collection indexed by the Swoogle search engine and also put into a RDF triple store. These are used to support several services enabling people and agents to semantically browse, query and visualize the stories in the collection, enabling access to information that would otherwise not be easy to find using simple keyword based search.

For example, one can browse through the story collection via the ontology to find stories that involve certain concepts, such as a terrorist organization; find all stories that involve an entities in OntoSem’s onomasticon, such as al qaeda or Karbala; visualize the stories on a map based on the locations they reference; or construct an arbitrary query, such as finding “stories in which the nation named Afghanistan was the location of a bombing event.” Users can also define semantic “alerts” as queries over the RDF triple store and/or the Swoogle collection. For each alert, SemNews will generate an RSS feed of the results.

The SemNews system is currently a research prototype that is being used to refine the underlying technologies and to explore how the sophisticated automatic linguistic processing of text can be integrated into the Semantic Web and conventional web applications. Ongoing work on SemNews includes an evaluation of its semantic recall and precision as well as a service that can group and cluster stories based on their semantic representations.

For more information

RDF molecules and lossless decompositions of RDF graphs

July 31st, 2005, by Tim Finin, posted in Swoogle, Semantic Web

Some RDF graphs can be viewed as making assertions about the world. Suppose you were given a graph, G, and asked to find supporting evidence on the web.

One approach is to search for documents with RDF graphs containing G as a sub-graph, adhering to RDF’s semantics for blank nodes and maybe applying some RDFS and OWL semantics. Even after doing that, few or maybe no RDF documents may contain *all* of G as a subgraph.

Another approach is to decompose G into its constituent triples and for each, use a Swoogle-like system to find documents containing it. But then what? The presence of blank nodes makes it difficult or impossible to assemble the support for G.

We’ve been exploring a third way using the notion of an RDF molecule. We start by computing a lossless decomposition of G into a set of subgraphs M. The decomposition is lossless in that combining the M’s elements produces the original graph G, even if their blank nodes have been renamed apart. We can then use a Swoogle-like system to search for documents supporting each molecule in M. Find support for all, we have support for G.

We suspect that the RDF molecule concept has other potential uses. For details, see

Tracking RDF Graph Provenance using RDF Molecules, Li Ding, Tim Finin, Yun Peng, Paulo Pinheiro da Silva, and Deborah McGuinness, report TR-CS-05-06, Computer Science and Electrical Engineering, University of Maryland, Baltimore County, April 30, 2005.

The Semantic Web facilitates integrating partial knowledge and finding evidence for hypothesis from web knowledge sources. However, the appropriate level of granularity for tracking provenance of RDF graph remains in debate. RDF document is too coarse since it could contain irrelevant information. RDF triple will fail when two triples share the same blank node. Therefore, this paper investigates lossless decomposition of RDF graph and tracking the provenance of RDF graph using RDF molecule, which is the finest and lossless component of an RDF graph. A sub-graph is lossless if it can be used to restore the original graph without introducing new triples. A sub-graph is finest if it cannot be further decomposed into lossless sub-graphs. The lossless decomposition algorithms and RDF molecule have been formalized and implemented by a prototype RDF graph provenance service in Swoogle project.

Stress test your RDF triple store

June 16th, 2005, by Tim Finin, posted in Swoogle, Web, Semantic Web

A colleague has been testing the scalablilty of a triple store using synthetic triples. He asked if we could package up a large collection of real triples caught in the wild by Swoogle. After talking a bit, it was decided that having them as a simple SQL database dump would be the most convenient form.

10M Triples is an SQL database dump containing a table that of about 10.4M RDF triples extracted from the Swoogle cache on June 15, 2005. The size of the compressed file is 162M and when uncompressed its size is 1.7G.

You are currently browsing the archives for the Swoogle category.

  Home | Archive | Login | Feed

Recent posts

  • The Psychology of Social Networking on KQED Forum show
  • Students: brand yourself with a blog
  • Social Data on the Web workshop at ISWC 2008
  • Petrini: Streaming Applications on the Cell BE Processor, 3pm 5/13 UMBC
  • Gossip-Based Outlier Detection for Mobile Ad Hoc Networks

  • Ebiquity community

  • Fieldmarking data blog
  • Geospatial Semantic Web
  • Harry Chen thinks aloud
  • Planet social media research
  • Social media research blog
  • TrackForward by Kolari
  • UMBC GAIM

  • UMBC