A post on Google’s research blog lists the major datasets for NLP and KB processing that Google has released in the past year. They include datasets to help in entity linking, relation extraction, concept spotting and syntactic analysis. Subscribe to the the Knowledge Data Releases mailing list for updates.
The UMBC Computer Science and Electrical Engineering department is searching for new full-time faculty: two in Computer Science, one in Electrical and Computer Engineering, one Computer Science professor of the practice, and one Computer Science/Information Systems lecturer. See the CSEE Jobs page for detailed information on the positions, preferred specializations and the application process.
The third Mid-Atlantic Student Colloquium on Speech, Language and Learning will be held at UMBC on Fri. 11 Oct 3013, bringing together students, postdocs, faculty and researchers from universities in the Mid-Atlantic area doing research on speech, language or machine learning. It is an opportunity for students and postdocs to present preliminary, ongoing or completed work and to network with other researchers working in related fields.
The first MASC-SLL was held in 2011 at Johns Hopkins University and the second in 2012 at the University of Maryland, College Park. This year the event will be held at the University of Maryland, Baltimore County (UMBC) in Baltimore, MD from 9:30 to 5:00 on Friday, 11 October 2013. There will be no registration charge and lunch and refreshments will be provided.
Students and postdocs are encouraged to submit abstracts describing ongoing, planned, or completed research projects, including previously published results and negative results. Research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields, is welcome. All accepted submissions will be presented as posters and some will also be invited for short oral presentations. Student-led breakout sessions will also be held to discuss papers or topics of interest and stimulate interaction and discussion. Suggest breakout session topics via easychair.
UMBC's Center for Hybrid Multicore Productivity Research, an NSF Industry & University Cooperative Research Center is holding its Industry Advisory Board meeting at UMBC 12-14 June. Students from UMBC and UCSD will present tutorials on a number of the technologies underlying ongoing CHMPR projects in a session from 1:00-5:00 on Wednesday June 12 in ITE 456. The tutorial session is free and open to the public.
3-D Printing – Timothy Blattner (UMBC)
Semantic Table Information – Varish Mulwad (UMBC)
Social Media Elastic Search – Oleg Aulov (UMBC)
Machine Learning for Social Media – Han Dong (UMBC)
Virtual World Interactions – Erik Hill (UCSD)
Directions and parking information is available here.
Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.
“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”
One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). The dog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.
A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.
Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.
Google announced support for embedding and using semantic information in gmail messages in either Microdata or JSON-LD. They currently handle several use cases (actions) that use the semantic markup and a way to define new actions.
Providing a way to embed data in JSON-LD should open up the system for experimentation with other vocabularies beyond schema.org. Since the approach just leverages the general ability to embed semantic data in HTML it is not restricted to gmail and can be used by any email environment that supports messages whose content is encoded as HTML.
We hope that this will lead to many exciting ideas, as developers experiment with applications that use the mechanism to embed and understand concepts, entities and facts in email messages.
A Semantic Resolution Framework for
Manufacturing Capability Data Integration
10:30am Tuesday, May 14, 2013, ITE 346, UMBC
Building flexible manufacturing supply chains requires interoperable and accurate manufacturing service capability (MSC) information of all supply chain participants. Today, MSC information, which is typically published either on the supplier’s web site or registered at an e-marketplace portal, has been shown to fall short of the interoperability and accuracy requirements. This issue can be addressed by annotating the MSC information using shared ontologies. However, ontology-based approaches face two main challenges: 1) lack of an effective way to transform a large amount of complex MSC information hidden in the web sites of manufacturers into a representation of shared semantics and 2) difficulties in the adoption of ontology-based approaches by the supply chain managers and users because of their unfamiliar of the syntax and semantics of formal ontology languages such as OWL and RDF and the lack of tools friendly for inexperienced users.
The objective of our research is to address the main challenges of ontology-based approaches by developing an innovative approach that can effectively extract a large volume of manufacturing capability instance data, accurately annotate these instance data with semantics and integrate these data under a formal manufacturing domain ontology. To achieve the objective, a Semantic Resolution Framework is proposed to guides every step of the manufacturing capability data integration process and to resolve semantic heterogeneity with minimal human supervision. The key innovations of this framework includes 1) three assisting systems, including a Triple Store Extractor, a Triple Store to Ontology Mapper and a Ontology-based Extensible Dynamic Form, that can efficiently and effectively perform the automatic processes of extracting, annotating and integrating manufacturing capability data.; 2) a Semantic Resolution Knowledge Base (SR-KB) that incrementally filled with, among other things, rules/patterns learned from errors. This SR-KB together with an Upper Manufacturing Domain Ontology (UMO) provide knowledge for resolving semantic differences in the integration process; 3) an evolution mechanism that enables SR-KB to continuously improve itself and gradually reduce the human involvement by learning from mistakes.
Committee: Yun Peng (chair), Charles Nicholas, Tim Finin, Yaacov Yesha, Boonserm Kulvatunyou (NIST)
“President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.”
While the policy doesn’t mention adding semantic markup to enhance machine understanding, calling for machine-readable datasets with persistent identifiers is a big step forward.
The UMBC WebBase corpus is a dataset of high quality English paragraphs containing over three billion words derived from the Stanford WebBase project’s February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.
The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.
We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.
The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).
The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.
Users can enter a question like Which of my friends who went to school at the University of Illinois live in California? which is translated into a query over Facebook’s Open Graph. That data structure is an RDF like graph of millions of entities and objects of various types that are connected thousands of types of relations. This is a very interesting and application of current human language technology to a highly visible and useful task!
“The prediction that search would become increasingly semantic and graph-based has certainly proven to be more than true. Not only have the search engines since adopted schema.org as a standard along with microdata as a syntax (Facebook RDFa and Open Graph are examples), but things are now elevated to the next level in this process of adoption.”
The schema.org vocabulary has been a big success and is being used by many popular content providers, but I’m less sure that Microdata is winning out over RDFa. I’ve seen reports that there is more data on the Web encoded in RDFa than Microdata.
It seems like an easy choice to use RDFa Lite over Microdata, since it’s just as simple and easy to use and lets you later add more features from full RDFa. The biggest RDFa feature is, of course, the ability to include statements from multiple vocabularies.
In the spirit of eating our own dog food, I hope to work on upgrading the ebiquity web site and blog to make fuller use of RDFa this summer.
There has recently been a spike in the number of compromised Twitter accounts, which has increased concerns about the trustworthiness of information broadcast on Twitter and other social networks. Just yesterday, the Associated Press Twitter account (@AP) was hacked and used to send out a false Twitter post about explosions at the White House. Last weekend saw Twitter accounts of CBS News (@60minutes & @48hours) compromised. Corporate accounts belonging to Burger King and Jeep were also hacked in February this year.
We are working on techniques to predict that a given account is “fake” (falsely appears to represent a person or organization) or has been compromised and is being used to spreading malicious content. Our approach analyses the account’s metadata, properties, network structure and the content in its posts. We also use both content and network analysis to identify the “real” account handle when multiple accounts appear or claim to represent the same person or organization on Twitter.
We recently analyzed a case where both @DeltaAssist and @flydeltassist appeared to represent Delta Airlines. In February 2013, @flydeltaAssist, which turned out not to be associated with Delta, began tweeting an offer of free tickets if users “followed” them. Eventually, the account was banned as a fake handle by Twitter. Our approach was able to answer the question “Which one of them belongs to the real Delta Airlines?” by analyzing the tweets and social network of these handles.
We are still in the process of writing up our research and evaluation results and hope to be able to post more about it soon.