Is there real world RDF-S/OWL instance data?

August 12th, 2006

Swoogle logoSören Auer started a tread on the semantic-web@w3.org mailing list asking Is there real world RDF-S/OWL instance data? :

“As an argument to stress the importance of the Semantic Web and as example data to evaluate tools it would be nice to have a library of real world RDF-S/OWL instance data available. My impression is, that there are many schemas around, but it’s harder to find real life instance data. This might be due to copyright issues or the fact, that the borderline between classes/schema and instances/data is not always clearly marked and some projects rather use a representation as classes than a representation as instance data.”

There are some tricky questions underlying this, like what counts as “real world” and what counts as “data”, but from our perspective, virtually all of the RDF content out there is at the instance level, not the schema level, at least formally.

Swoogle has a collection of over 1M error-free RDF documents collected from the Web and an additional ~700K documents that have embedded RDF, are malformed but appear to be RDF, or are no longer accessible. We’ve intentionally limited the number of simple RSS and FOAF documents in the current collection.

Only about 5% of these documents contain *any* triples that contribute to a definition. The rest consist of all data. We’ve determined that most of the 5% that contain definitional triples do so incorrectly and should be all data. Of the remaining ones, many are duplicates and copies. We estimate that only about 1% of Swoogle’s collection are proper ‘ontologies’ that are intended to (partially) define at least one named term.

For the ~1.7M Semantic Web Documents (SWDs), the following table shows the number and percentage of SWDs by the percent of their triples that are at the schema level.

%def
# SWDs
%all SWDs
0%
1,676,874
94.70
0-10%
1,679,153
94.83
10-20%
2,512
0.14
20-30%
3,209
0.18
30-40%
35,526
2.01
40-50%
16,384
0.93
50-60%
1,817
0.10
60-70%
5,556
0.31
70-80%
4,063
0.23
80-90%
1,599
0.09
90-100%
5,108
0.29
100%
15,756
0.89

That said, the vast majority of defined classes have no immediate instances and the majority of properties have never been used to assert a value. This table shows for both classes and properties, the number that have been defined either explicitly or implicitly through reference, the number that have been populated, and the percent that have been populated.

type
def/ref
pop
%pop
classes
1,386,272
34,018
2.45%
properties
156,131
42,839
27.44%

Based on this data, many classes have been introduced but not immediate instantiated. Properties are much more likely to be used to assert values of an instance once introduced. These statistics are probably influenced by a few large SWDs that define many classes that are intended to be used as data. WorldNet is a good example. The usage patterns for RDF terms is not so surprising when you compare it to word use frequency in natural language (e.g., see Zipf’s law) whicn follows a power law curve.