Google announced support for embedding and using semantic information in gmail messages in either Microdata or JSON-LD. They currently handle several use cases (actions) that use the semantic markup and a way to define new actions.
Providing a way to embed data in JSON-LD should open up the system for experimentation with other vocabularies beyond schema.org. Since the approach just leverages the general ability to embed semantic data in HTML it is not restricted to gmail and can be used by any email environment that supports messages whose content is encoded as HTML.
We hope that this will lead to many exciting ideas, as developers experiment with applications that use the mechanism to embed and understand concepts, entities and facts in email messages.
A Semantic Resolution Framework for
Manufacturing Capability Data Integration
10:30am Tuesday, May 14, 2013, ITE 346, UMBC
Building flexible manufacturing supply chains requires interoperable and accurate manufacturing service capability (MSC) information of all supply chain participants. Today, MSC information, which is typically published either on the supplier’s web site or registered at an e-marketplace portal, has been shown to fall short of the interoperability and accuracy requirements. This issue can be addressed by annotating the MSC information using shared ontologies. However, ontology-based approaches face two main challenges: 1) lack of an effective way to transform a large amount of complex MSC information hidden in the web sites of manufacturers into a representation of shared semantics and 2) difficulties in the adoption of ontology-based approaches by the supply chain managers and users because of their unfamiliar of the syntax and semantics of formal ontology languages such as OWL and RDF and the lack of tools friendly for inexperienced users.
The objective of our research is to address the main challenges of ontology-based approaches by developing an innovative approach that can effectively extract a large volume of manufacturing capability instance data, accurately annotate these instance data with semantics and integrate these data under a formal manufacturing domain ontology. To achieve the objective, a Semantic Resolution Framework is proposed to guides every step of the manufacturing capability data integration process and to resolve semantic heterogeneity with minimal human supervision. The key innovations of this framework includes 1) three assisting systems, including a Triple Store Extractor, a Triple Store to Ontology Mapper and a Ontology-based Extensible Dynamic Form, that can efficiently and effectively perform the automatic processes of extracting, annotating and integrating manufacturing capability data.; 2) a Semantic Resolution Knowledge Base (SR-KB) that incrementally filled with, among other things, rules/patterns learned from errors. This SR-KB together with an Upper Manufacturing Domain Ontology (UMO) provide knowledge for resolving semantic differences in the integration process; 3) an evolution mechanism that enables SR-KB to continuously improve itself and gradually reduce the human involvement by learning from mistakes.
Committee: Yun Peng (chair), Charles Nicholas, Tim Finin, Yaacov Yesha, Boonserm Kulvatunyou (NIST)
“President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.”
While the policy doesn’t mention adding semantic markup to enhance machine understanding, calling for machine-readable datasets with persistent identifiers is a big step forward.
The UMBC WebBase corpus is a dataset of high quality English paragraphs containing over three billion words derived from the Stanford WebBase project’s February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.
The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.
We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.
The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).
The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.
Users can enter a question like Which of my friends who went to school at the University of Illinois live in California? which is translated into a query over Facebook’s Open Graph. That data structure is an RDF like graph of millions of entities and objects of various types that are connected thousands of types of relations. This is a very interesting and application of current human language technology to a highly visible and useful task!
“The prediction that search would become increasingly semantic and graph-based has certainly proven to be more than true. Not only have the search engines since adopted schema.org as a standard along with microdata as a syntax (Facebook RDFa and Open Graph are examples), but things are now elevated to the next level in this process of adoption.”
The schema.org vocabulary has been a big success and is being used by many popular content providers, but I’m less sure that Microdata is winning out over RDFa. I’ve seen reports that there is more data on the Web encoded in RDFa than Microdata.
It seems like an easy choice to use RDFa Lite over Microdata, since it’s just as simple and easy to use and lets you later add more features from full RDFa. The biggest RDFa feature is, of course, the ability to include statements from multiple vocabularies.
In the spirit of eating our own dog food, I hope to work on upgrading the ebiquity web site and blog to make fuller use of RDFa this summer.
There has recently been a spike in the number of compromised Twitter accounts, which has increased concerns about the trustworthiness of information broadcast on Twitter and other social networks. Just yesterday, the Associated Press Twitter account (@AP) was hacked and used to send out a false Twitter post about explosions at the White House. Last weekend saw Twitter accounts of CBS News (@60minutes & @48hours) compromised. Corporate accounts belonging to Burger King and Jeep were also hacked in February this year.
We are working on techniques to predict that a given account is “fake” (falsely appears to represent a person or organization) or has been compromised and is being used to spreading malicious content. Our approach analyses the account’s metadata, properties, network structure and the content in its posts. We also use both content and network analysis to identify the “real” account handle when multiple accounts appear or claim to represent the same person or organization on Twitter.
We recently analyzed a case where both @DeltaAssist and @flydeltassist appeared to represent Delta Airlines. In February 2013, @flydeltaAssist, which turned out not to be associated with Delta, began tweeting an offer of free tickets if users “followed” them. Eventually, the account was banned as a fake handle by Twitter. Our approach was able to answer the question “Which one of them belongs to the real Delta Airlines?” by analyzing the tweets and social network of these handles.
We are still in the process of writing up our research and evaluation results and hope to be able to post more about it soon.
Heather McIlvaine from enterprise software giant SAP blogs about open data: “How are mobile apps, Big Data, and civic hacking changing the nature of open data in government? The Center for Technology in Government took a look at this topic and presents its findings”. See The Future Of Open Data.
If you are interested in the semantic web and linked data, data.ac.uk looks like a site worth investigating.
“This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information. Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.”
The results of the 2013 Semantic Textual Similarity task (STS) are out. We were happy to find that our system did very well on the core task, placing first out of the 35 participating teams. The three runs we submitted were ranked first, second and third in the overall summary score.
Congratulations are in order for Lushan Han and Abhay Kashyap, the two UMBC doctoral students whose research and hard work produced a very effective system.
The STS task
The STS core task is to take two sentences and to return a score between 0 and 5 representing how similar the sentences are, with a larger number meaning a higher similarity. Compared with word similarity, the definition of sentence similarity tends to be more difficult and different people may have different views.
The STS task provides a reasonable and interesting definition. More importantly, the Pearson correlation scores are about 0.90  for human raters using Amazon Mechanical Turk on the 2012 STS gold standard datasets, almost same to inter-rater agreement level, 0.9026 , on the well-known Miller-Charles word similarity dataset. This shows that human raters largely agree on the definitions used in the scale.
5: The sentences are completely equivalent, as they mean the same thing, e.g., “The bird is bathing in the sink” and “Birdie is washing itself in the water basin”.
4: The sentences are mostly equivalent, but some unimportant details differ, e.g., “In May 2010, the troops attempted to invade Kabul” and “The US army invaded Kabul on May 7th last year, 2010”.
3: The sentences are roughly equivalent, but some important information differs/missing, e.g., “John said he is considered a witness but not a suspect.” and “‘He is not a suspect anymore.’ John said.”
2: The sentences are not equivalent, but share some details, e.g., “They flew out of the nest in groups” and “They flew into the nest together”.
1: The sentences are not equivalent, but are on the same topic, e.g., “The woman is playing the violin” and “The young lady enjoys listening to the guitar”.
0: The sentences are on different topics, e.g., “John went horse back riding at dawn with a whole group of friends” and “Sunrise at dawn is a magnificent view to take in if you wake up early enough for it”.
The STS datasets
There were 86 runs submitted from more than 35 teams. Each team could submit up to three runs over sentence pairs drawn from four datasets, which included the following.
Headlines (750 pairs): a collection of pairs of headlines mined from several news sources by European Media Monitor using the RSS feed, e.g., “Syrian rebels move command from Turkey to Syria” and “Free Syrian Army moves headquarters from Turkey to Syria”.
SMT (750 pairs): a collection with sentence pairs the DARPA GALE program, where one sentence is the output of a machine translation system and the other is a reference translation provided by a human, for example, “The statement, which appeared on a website used by Islamists, said that Al-Qaeda fighters in Islamic Maghreb had attacked three army centers in the town of Yakouren in Tizi-Ouzo” and the sentence “the pronouncement released that the mujaheddin of al qaeda in islamic maghreb countries attacked 3 stations of the apostates in city of aekorn in tizi ouzou , which was posted upon the web page used by islamists”.
OnWN (561 pairs): a collection of sentence pairs describing word senses, one from OntoNotes and another from WordNet, e.g., “the act of advocating or promoting something” and “the act of choosing or selecting”.
FNWN (189 pairs): a collection of pairs of sentences describing word senses, one from FrameNet and another from WordNet, for example: “there exist a number of different possible events that may happen in the future. in most cases, there is an agent involved who has to consider which of the possible events will or should occur. a salient_entity which is deeply involved in the event may also be mentioned” and “doing as one pleases or chooses;”.
Our three systems
We used a different system for each of our allowed runs, PairingWords, Galactus and Saiyan. While they shared a lot of the same infrastructure, each used a different mix of ideas and features.
ParingWords was built using hybrid word similarity features derived from LSA and WordNet. It used a simple algorithm to pair words/phrases in two sentences and compute the average of word similarity of the resulting pairs. It imposes penalties on words that are not matched with the words weighted by their PoS and log frequency. No training data is used. An online demonstration system is available to experiment with the underlying word similarity model used by this approach.
Galactus used unigrams, bigrams, trigrams and skip bigrams derived from the two sentences and paired them with the highest similarity based on exact string match, corpus and Wordnet based similarity metrics. These, along with contrast scores derived from antonym pairs, were used as features to train a support vector regression model to predict the similarity scores.
Saiyan was a fine tuned version of galactus which used domain specific features and training data to train a support vector regression model to predict the similarity scores. (Scores for FNWN was directly used from the PairingWords run.)
Here’s how our three runs ranked (out of 86) on each of the four different data sets and on the overall task (mean).
A post in Micrsoft’s Bing blog, Understand Your World with Bing, announced that an update to their Satori knowledge base allows Bing to do a better job of identifying queries that mention a known entity, i.e., a person, place of organization. Bing’s use of Satori parallels the efforts of Google and Facebook in developing graph-based knowledge bases to move from “strings” to “things”.
Microsoft is using data from Satori to provide “snapshots” with data about an entity when it detects a likely mention of it in a query. This is very similar to how Google is using its Knowledge Graph KB.
One interesting thing that Satori is now doing is importing data from LinkedIn — data that neither Google’s Knowledge Graph nor Facebook’s Open Graph has. Another difference is that Satori uses RDF as its native model, or at least appears to, based on this description from 2012.
We felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were about to be silenced. We fear something terrible is about to happen. We went access the blogs we follow on Google Reader and found this.
Powering Down Google Reader
3/13/2013 04:06:00 PM
Posted by Alan Green, Software Engineer
We have just announced on the Official Google Blog that we will soon retire Google Reader (the actual date is July 1, 2013). We know Reader has a devoted following who will be very sad to see it go. We’re sad too.
There are two simple reasons for this: usage of Google Reader has declined, and as a company we’re pouring all of our energy into fewer products. We think that kind of focus will make for a better user experience.
To ensure a smooth transition, we’re providing a three-month sunset period so you have sufficient time to find an alternative feed-reading solution. If you want to retain your Reader data, including subscriptions, you can do so through Google Takeout.
Thank you again for using Reader as your RSS platform.
Labels: reader, sunset
Where is old Bloglines now that we need him again? We should not have been so disloyal.