On finding semantic web documents

January 14th, 2005

After looking at the piece on Peter Norvig’s views on the semantic web (Semantic Web Ontologies: What Works and What Doesn’t), I realized that he’s talking about a request we made when we started developing Swoogle:

“A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn’t find them all. I looked, and it turns out there’s only around 200,000 of them. That’s about 0.005% of the web. We’ve got a ways to go.”

We never did get any help from Google. What we did do was develop a work around to Google’s restriction of only giving 1000 results for any query, enabling us to more effectively use Google to find a set of initial seed URLs of semantic web documents (SWDs) to bootstrap the Swoogle crawler. Using these initial seeds, we employ a custom SWD crawler to crawl through SODs and a custom focused crawler to dig through HTML documents and directories. Using Swoogle, we have found on the order of two million SWDs (RDF files in XML or N3) publicly accessible on the web.

The hack we employ is to use Google’s ‘site:’ qualifier to narrow the search. So we query on “filetype:owl” and get back 1000 results drawn from many different sites. After filtering out the non OWL documents, we extract a list of the sites from which the valid ones came. For example, if https://ebiquity.umbc.edu/ontologies/event.owl is in the initial result set, we note that ‘ebiquity.umbc.edu’ is a site that had at least one OWL file. For each new site S we encounter, we give Google the narrower query ‘filetype:owl site:S’, which will often end up getting some additional results not included in the earlier query.

For Google, a site qualifier specifies a suffix of the server’s symbolic address, so a simple refinement generates other potential site specifiers, e.g., if we find an OWL file at ‘ebiquity.umbc.edu’, we can generate the other sites (‘umbc.edu’, ‘edu’) and add them to the potential site table for querying. So, an important part of Swoogle’s database is the list of sites where we’ve found at least one SWD. Swoogle maintains a list of the top 500 sites from which we’ve extracted the most SWDs.

There are many wrinkles to this process. For example, not every SWD use a file suffix that indicates or even suggests its type. Swoogle can also produce a current analysis of the distribution of Swoogle’s documents by suffix. The second most common suffix is nothing and the fourth is ‘.xml’. And of course, some suffixes, like ‘.rss’ only imply that the file might be an RDF file.

While Google will only give you at most 1000 results for a query, it tried to be helpful in estimating the total number of results it could return. (Or is is taunting us?). We could use this information to inform Swoogle’s focused web crawler about how much effort to spend in rooting around in a site looking for SWDs. Currently, Swoogle’s focused crawler searches to a fixed depth and does not use this information.

As of this writing, I’d guess there are at least two million SWDs accessible on the web. Most of these are FOAF or RSS documents. In order to keep Swoogle’s collection more interesting and representative, we’ve limited the number of documents we collect from any given site, so it purposely ignores many FOAF documents it discovers. We have develop specialized datasets with many of these ignored SWDs. Currently Swoogle has about 340K SWDs indexed.

Note that we have a pretty narrow definition of a semantic web document — an RDF document encoded in XML or N3. There are lots of other uses of RDF content: embedded RDF in HTML documents, in other document types (e.g., PDF, JPG), in databases, etc. I think it’s hard to predict what the most important use cases will be for semantic web technologies.