UMBC ebiquity
Semantic Web

Archive for the 'Semantic Web' Category

An ontology of social media data for better privacy policies

August 15th, 2010, by Tim Finin, posted in Policy, Privacy, Security, Semantic Web, Social media

Privacy continues to be an important topic surrounding social media systems. A big part of the problem is that virtually all of us have a difficult time thinking about what information about us is exposed and to whom and for how long. As UMBC colleague Zeynep Tufekci points out, our intuitions in such matters come from experiences in the physical world, a place whose physics differs considerably from the cyber world.

Bruce Schneier offered a taxonomy of social networking data in a short article in the July/August issue of the IEEE Security & Privacy. A version of the article, A Taxonomy of Social Networking Data, is available on his site.

“Below is my taxonomy of social networking data, which I first presented at the Internet Governance Forum meeting last November, and again — revised — at an OECD workshop on the role of Internet intermediaries in June.

  • Service data is the data you give to a social networking site in order to use it. Such data might include your legal name, your age, and your credit-card number.
  • Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
  • Entrusted data is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data once you post it — another user does.
  • Incidental data is what other people post about you: a paragraph about you that someone else writes, a picture of you that someone else takes and posts. Again, it’s basically the same stuff as disclosed data, but the difference is that you don’t have control over it, and you didn’t create it in the first place.
  • Behavioral data is data the site collects about your habits by recording what you do and who you do it with. It might include games you play, topics you write about, news articles you access (and what that says about your political leanings), and so on.
  • Derived data is data about you that is derived from all the other data. For example, if 80 percent of your friends self-identify as gay, you’re likely gay yourself.”

I think most of us understand the first two categories and can easily choose or specify a privacy policy to control access to information in them. The rest however, are more difficult to think about and can lead to a lot of confusion when people are setting up their privacy preferences.

As an example, I saw some nice work at the 2010 IEEE International Symposium on Policies for Distributed Systems and Networks on “Collaborative Privacy Policy Authoring in a Social Networking Context” by Ryan Wishart et al. from Imperial college that addressed the problem of incidental data in Facebook. For example, if I post a picture and tag others in it, each of the tagged people can contribute additional policy constraints that can narrow access to it.

Lorrie Cranor gave an invited talk at the workshop on Building a Better Privacy Policy and made the point that even P3P privacy policies are difficult for people to comprehend.

Having a simple ontology for social media data could help us move forward toward better privacy controls for online social media systems. I like Schneier’s broad categories and wonder what a more complete treatment defined using Semantic Web languages might be like.

Papers with more references are cited more often

August 15th, 2010, by Tim Finin, posted in Semantic Web, Social media

The number of citations a paper receives is generally thought to be a good and relatively objective measure of its significance and impact.

Researchers naturally are interested in knowing how to attract more citations to their papers. Publishing the results of good work helps of course, but everyone knows there are many other factors. Nature news reports on research by Gregory Webster that analyzed the 53,894 articles and review articles published in Science between 1901 and 2000.

The advice the study supports is “cite and you shall be cited”.

A long reference list at the end of a research paper may be the key to ensuring that it is well cited, according to an analysis of 100 years’ worth of papers published in the journal Science.
     The research suggests that scientists who reference the work of their peers are more likely to find their own work referenced in turn, and the effect is on the rise, with a single extra reference in an article now producing, on average, a whole additional citation for the referencing paper.
     ’There is a ridiculously strong relationship between the number of citations a paper receives and its number of references,” Gregory Webster, the psychologist at the University of Florida in Gainesville who conducted the research, told Nature. “If you want to get more cited, the answer could be to cite more people.’

A plot of the number of references listed in each article against the number of citations it eventually received reveal that almost half of the variation in citation rates among the Science papers can be attributed to the number of references that they include. And — contrary to what people might predict — the relationship is not driven by review articles, which could be expected, on average, to be heavier on references and to garner more citations than standard papers.

Swoogle has five faces

August 13th, 2010, by Tim Finin, posted in Semantic Web, Swoogle

Seen on the Web: “Swoogle is an alien from outer space send out to spy on the modnation circuit. He got five faces so he can watch them from all angles without turning his head. However only his front shows many emotions. His right face is always angry, his left face is always in awe for some reason.”

Semantic Web seen as a distruptive technology

August 1st, 2010, by Tim Finin, posted in Semantic Web

Washington Technology, which describes itself as “the online authority for government contractors and partners”), has an article by Carlos A. Soto on 5 technologies that will change the market. They are:

  1. Mobile
  2. Search and the Semantic Web
  3. Search and the Semantic Web
  4. Virtualization and cloud computing
  5. Virtualization and cloud computing

These are reasonable choices, thought I’ve have not done the double counting and added “machine learning applied to the massive amounts of Web data now available” and “social computing”.

But it’s gratifying to see the Semantic Web in the list. Here’s some of what he he has to say about search and the Semantic Web.

The relationship between search technology and the Semantic Web is a perfect illustration of how a small sustaining technology, such as a basic search feature on an operating system, will eventually be eaten up by a larger disruptive technology, such as the Semantic Web. The Semantic Web has the potential of acting like a red giant star by expanding at exponential rates, swallowing whole planets of existing technology in the process.

The technology started as a simple group of secure, trusted, linked data stores. Now Semantic Web technologies enable people to create data stores on the Web and then build vocabularies or write rules for handling the data. Because all the data by definition is trusted, security is often less of a problem.

The task of turning the World Wide Web into a giant dynamic database is causing a shift among traditional search engines because products such as Apture, by Apture Inc. of San Francisco, Calif., let content publishers include pop-up definitions, images or data whenever a user scrolls over a word on a Web site. The ability to categorize content in this manner could have significant implications not only for Web searches but also for corporate intranets and your desktop PC.

These types of products will continue to expand, initially in the publishing industry and then to most industries on the Web in the next two to three years.

For example, human resources sites could use them to pop up a picture and a résumé blip when a recruiter drags a mouse over an applicant’s name. Medical and financial sites such as the National Institutes of Health could use it to break down jargon and help with site exploration.

Government sites around the world, such as Zaragoza, Spain, and medical facilities, such as the Cleveland Medical Clinic, are using the vocabulary features of the Semantic Web to create search engines that reach across complex jargon and tech silos to offer a high degree of automation, full integration with external systems and various terminologies, in addition to the ability to accurately answer users’ queries.
…”

(h/t @FrankVanHarmele)

What is up with Clearspring and malware?

July 31st, 2010, by Tim Finin, posted in Semantic Web

Google Chrome has been showing me a malware warning page today as I try to visit normally trusted and benign sites. I got this one just now as I tried to got to Planet RDF.

Warning: Visiting this site may harm your computer!

The website at planetrdf.com contains elements from the site bin.clearspring.com, which appears to host malware – software that can hurt your computer or otherwise operate without your consent. Just visiting a site that contains malware can infect your computer.

For detailed information about the problems with these elements, visit the Google Safe Browsing diagnostic page for bin.clearspring.com.

Learn more about how to protect yourself from harmful software online.

[ ] I understand that visiting this site may harm my computer. PROCEED

Clearspring claims it’s a technical problem, although they admit they were using a service that was compromised with files redirecting users to a certain malware domain. I’m a bit fuzzy on what clearspring does and where they are being used on the Planet RDF site. I don’t see it in the page source, for example.

update: Maybe the problem stems from flash cookies in blog content being syndicated by Planet RDF that have flash objects mediated by clearspring.

W3C EmotionML provides markup for emotions

July 31st, 2010, by Tim Finin, posted in KR, Semantic Web, Social media, Web

The W3C has published a second working draft of EmotionML, or the emotion markup language, Here’s how it’s described.

As the web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The present draft specification of Emotion Markup Language 1.0 aims to strike a balance between practical applicability and scientific well-foundedness. The language is conceived as a “plug-in” language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

Unfortunately EmotionML is not built on RDF. If it were, I would have marked up this post in RDFa using it!

The working draft identifies concrete examples where EmotionML might be useful including as a markup or representation for systems that do opinion mining, sentiment analysis, affect monitoring, and emotion recognition. A list of 39 individual use cases for EmotionML are given in an appendix.

EmotionML markup explicitly refers to one or more separate vocabularies used for representing emotion-related states. However, the group has defined some default vocabularies that can be used. An example is the Ekman “big six” basic emotions (anger, disgust, fear, happiness, sadness, and surprised). Another is the a set of appraisal terms defined by Ortony et al. (desirability, praiseworthiness, appealingness,, desirability-for-other, deservingness, liking, likelihood, effort, realization, strength-of-identification, expectation-of-deviation and familiarity)

Here’s an example from the working draft where a static image is annotated with several emotion categories with different intensities.

<emotionml xmlns="http://www.w3.org/2009/10/emotionml"
           xmlns:meta="http://www.example.com/metadata"
           category-set="http://www.example.com/custom/
                hall-matsumoto-emotions.xml">
   <info>
      <meta:media-type>image</meta:media-type>
      <meta:media-id>disgust</meta:media-id>
      <meta:media-set>JACFEE-database</meta:media-set>
      <meta:doc>Example adapted from (Hall and Matsumoto 2004) 

http://www.davidmatsumoto.info/Articles/

          2004_hall_and_matsumoto.pdf
      </meta:doc>
   </info>

   <emotion>
       <category name="Disgust"/>
       <intensity value="0.82"/>
   </emotion>
   <emotion>
       <category name="Contempt"/>
       <intensity value="0.35"/>
   </emotion>
   <emotion>
       <category name="Anger"/>
       <intensity value="0.12"/>
   </emotion>
   <emotion>
       <category name="Surprise"/>
       <intensity value="0.53"/>
   </emotion>
</emotionml>

rdfs:seeAlso the short article by InqoQ on the EmotionML working draft.

Google acquires Metaweb and Freebase

July 16th, 2010, by Tim Finin, posted in Database, Google, sEARCH, Semantic Web, Social media, Web

Google announced today that it has acquired Metaweb, the company behind Freebase — a free, semantic database of “over 12 million people, places, and things in the world.” This is from their announcement on the Official Google blog:

“Over time we’ve improved search by deepening our understanding of queries and web pages. The web isn’t merely words — it’s information about things in the real world, and understanding the relationships between real-world entities can help us deliver relevant information more quickly. … With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”

In their announcement, Google promises to continue to maintain Freebase “as a free and open database for the world” and invites other web companies use and contribute to it.

Freebase is a system very much in the linked open data spirit, even thought RDF is not its native representation. It’s content is available as RDF and there are many links that bind it to the LOD cloud. Moreover, Freebase has a very good wiki-like interface allowing people to upload, extend and edit both its schema and data.

Here’s a video on the concepts behind Metaweb which are, of course, also those underlying the Semantic Web. What the difference — I’d say a combination of representational details and centralized (Metaweb) vs. distributed (Semantic Web).

Search neutrality: Google and Danny Sullivan weigh in

July 16th, 2010, by Tim Finin, posted in Google, Semantic Web, Social media, Web

Web search guru Danny Sullivan has a great response to the NYT editorial on regulating search engine algorithms: The New York Times Algorithm and Why It Needs Government Regulation. Here’s how it starts:

“The New York Times is the number one newspaper web site. Analysts reckon it ranks first in reach among US opinion leaders. When the New York Times editorial staff tweaks its supersecret algorithm behind what to cover and exactly how to cover a story — as it does hundreds of times a day — it can break a business that is pushed down in coverage or not covered at all.”

Google published its own response to the Times piece as a Financial Times op-ed and also posted it to the Google public policy blog: regulating what is “best” in search?

“Search engines use algorithms and equations to produce order and organisation online where manual effort cannot. These algorithms embody rules that decide which information is “best”, and how to measure it. Clearly defining which of any product or service is best is subjective. Yet in our view, the notion of “search neutrality” threatens innovation, competition and, fundamentally,your ability as a user to improve how you find information.”

The penultimate paragraph gives what they say is their strongest argument againt mandating “search neutrality”.

“But the strongest arguments against rules for “neutral search” is that they would make the ranking of results on each search engine similar, creating a strong disincentive for each company to find new, innovative ways to seek out the best answers on an increasingly complex web. What if a better answer for your search, say, on the World Cup or “jaguar” were to appear on the web tomorrow? Also, what if a new technology were to be developed as powerful as PageRank that transforms the way search engines work? Neutrality forcing standardised results removes the potential for innovation and turns search into a commodity.”

This assumes of course, that there is real competition among Internet search engines. Microsoft has been putting a lot of research and development into Bing with good results and it’s been gaining market share. Yahoo is doing very interesting this as well. Consumer choice among a handful of competitors would be the best way to ensure that none abuse their customers.

Barry Smith short course online: An Introduction to ontology

July 15th, 2010, by Tim Finin, posted in AI, KR, Semantic Web, Web

Here’s a great resource if you want to come up to speed on ontologies and their importance today.

Professor Barry Smith of the University at Buffalo held a two-day course, An Introduction to Ontology: From Aristotle to the Universal Core, in 2009, to introduce ontologies and their applications to both philosophers and computer scientists. It consisted of of eight lectures for which slides and downloadable videos are available. Paul Alexander has also made the videos available in streaming form here if you want to view them without downloading.

The lectures are all either 60 or 90 minutes. Here are links to the streaming videos, thanks to Paul Alexander:

  • Ontology as a Branch of Philosophy
  • Ontology and Logic
  • The Ontology of Social Reality
  • Why I Am Not a Philosopher (or: Ontology Leaving the Mother Ship of Philosophy)
  • Why Computer Science Needs Philosophy
  • Ontology and the Semantic Web
  • Towards a Standard Upper Level Ontology
  • The Universal Core: Ontology and the US Federal Government Data Integration Initiative
  • New York Times editorializes about the Google search ranking algorithm

    July 15th, 2010, by Tim Finin, posted in Google, Semantic Web, Social media, Web

    In what may be a first, today’s New York Times has an editorial about an algorithm. No, they haven’t waded into the P=NP issue, but commented on Google’s algorithm for ranking search results and accusations that Google unfairly biases it for its own self interest.

    “In the past few months, Google has come under investigation by antitrust regulators in Europe. Rivals have accused Google of placing the Web sites of affiliates like Google Maps or YouTube at the top of Internet searches and relegating competitors to obscurity down the list. In the United States, Google said it expects antitrust regulators to scrutinize its $700 million purchase of the flight information software firm ITA, with which it plans to enter the online travel search market occupied by Expedia, Orbitz, Bing and others.”

    This issue will become more important as the companies dominating Web search (Google, Microsoft and Yahoo) continue to increase their importance and also broaden their acquisition of companies offering web services.

    The NYT’s position is moderate, recommending:

    Google provides an incredibly valuable service, and the government must be careful not to stifle its ability to innovate. Forcing it to publish the algorithm or the method it uses to evaluate it would allow every Web site to game the rules in order to climb up the rankings — destroying its value as a search engine. Requiring each algorithm tweak to be approved by regulators could drastically slow down its improvements. Forbidding Google to favor its own services — such as when it offers a Google Map to queries about addresses — might reduce the value of its searches. With these caveats in mind, if Google is to continue to be the main map to the information highway, it concerns us all that it leads us fairly to where we want to go.

    Google Open Spot Android app finds parking

    July 9th, 2010, by Tim Finin, posted in Google, Mobile Computing, Semantic Web, Social media

    sf_retrieving_spotGoogle’s Open Spot Android app lets people leaving parking spots share the information with others searching for parking nearby. Running the app shows you parking spots within a 1.5km. New parking spots are assumed to be gone after 20 minutes and removed from the system.

    People who announce open spots gain karma points, while those who report false spots, known as griefers, are on notice:

    “We’re watching for behavior that looks like a griefer spoofing parking spots. We have a couple of mechanisms available to make sure someone can’t leave a bunch of fake parking spots. If we see this happening we will take steps to fix it.

    This is a simple example of a context-aware mobile app that can further benefit from also knowing that you are driving, as opposed to riding, in your car and likely to want to find a parking spot, as opposed to doing 70mph on I-95 as it goes through Baltimore. Moreover, context would also inform that app that you are probably leaving a public parking spot and mark it automatically. However, such a feature should be smart enough to avoid being tagged by Google as a griefer and finding out what punishment Google has in store for you.

    USCYBERCOM secret revealed

    July 8th, 2010, by Tim Finin, posted in GENERAL, Mobile Computing, Security, Semantic Web
    USCYBERCOM logo.  Click to enlarge.

    The secret message embedded in the USCYBERCOM logo

         9ec4c12949a4f31474f299058ce2b22a

    is what the md5sum function returns when applied to the string that is USCYBERCOM’s official mission statement. Here’s a demonstration of this fact done on a Mac. On linux, use the md5sum command instead of md5.

    ~> echo -n "USCYBERCOM plans, coordinates, integrates, \
    synchronizes and conducts activities to: direct the \
    operations and defense of specified Department of \
    Defense information networks and; prepare to, and when \
    directed, conduct full spectrum military cyberspace \
    operations in order to enable actions in all domains, \
    ensure US/Allied \ freedom of action in cyberspace and \
    deny the same to our adversaries." | md5
    9ec4c12949a4f31474f299058ce2b22a
    ~>
    

    md5sum is a standard Unix command that computes a 128 bit “fingerprint” of a string of any length. It is a well designed hashing function that has the property that its very unlikely that any two non-identical strings in the real world will have the same md5sum value. Such functions have many uses in cryptography.

    Thanks to Ian Soboroff for spotting the answer on Slashdot and forwarding it.

    Someone familiar with md5 would recognize that the secret string has the same length and character mix as an md5 value — 32 hexadecimal characters. Each of the possible hex characters (0123456789abcdef) represents four bits, so 32 of them is a way to represent 128 bits.

    We’ll leave it as an exercise for the reader to compute the 128 bit sequence that our secret code corresponds to.

    You are currently browsing the archives for the Semantic Web category.

      Home | Archive | Login | Feed