UMBC ebiquity
NLP

Archive for the 'NLP' Category

Google Top Charts uses the Knowledge Graph for entity recognition and disambiguation

May 23rd, 2013, by Tim Finin, posted in AI, Google, KR, NLP, OWL, Semantic Web

Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.

Here’s how it’s explained in in the Trends Top Charts FAQ.

“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”

One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). The dog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.

Of course, when doing this kind of mapping of terms to objects, we only want to consider concepts that commonly have words or short phrases used to denote them. Not all concepts do, such as animals that from a long way off look like flies.

A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.

Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.

UMBC WebBase corpus of 3B English words

May 1st, 2013, by Tim Finin, posted in Machine Learning, NLP, Semantic Web

The UMBC WebBase corpus is a dataset of high quality English paragraphs containing over three billion words derived from the Stanford WebBase project’s February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.

The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.

We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.

The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).

The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.

Download the corpus from here.

The NLP behind Facebook’s graph search

April 29th, 2013, by Tim Finin, posted in NLP, Semantic Web, Social media

Facebook engineers Xiao Li and Maxime Boucher describe the language processing techniques used to implement Facebook’s graph search in a recent post on the Facebook Engineering page (alternative for non-facebook-users via VentureBeat).

Users can enter a question like Which of my friends who went to school at the University of Illinois live in California? which is translated into a query over Facebook’s Open Graph. That data structure is an RDF like graph of millions of entities and objects of various types that are connected thousands of types of relations. This is a very interesting and application of current human language technology to a highly visible and useful task!

Results of the 2013 Semantic Textual Similarity task

March 25th, 2013, by Tim Finin, posted in Machine Learning, NLP, Semantic Web

The results of the 2013 Semantic Textual Similarity task (STS) are out. We were happy to find that our system did very well on the core task, placing first out of the 35 participating teams. The three runs we submitted were ranked first, second and third in the overall summary score.

Congratulations are in order for Lushan Han and Abhay Kashyap, the two UMBC doctoral students whose research and hard work produced a very effective system.

The STS task

The STS core task is to take two sentences and to return a score between 0 and 5 representing how similar the sentences are, with a larger number meaning a higher similarity. Compared with word similarity, the definition of sentence similarity tends to be more difficult and different people may have different views.

The STS task provides a reasonable and interesting definition. More importantly, the Pearson correlation scores are about 0.90 [1] for human raters using Amazon Mechanical Turk on the 2012 STS gold standard datasets, almost same to inter-rater agreement level, 0.9026 [2], on the well-known Miller-Charles word similarity dataset. This shows that human raters largely agree on the definitions used in the scale.

  • 5: The sentences are completely equivalent, as they mean the same thing, e.g., “The bird is bathing in the sink” and “Birdie is washing itself in the water basin”.
  • 4: The sentences are mostly equivalent, but some unimportant details differ, e.g., “In May 2010, the troops attempted to invade Kabul” and “The US army invaded Kabul on May 7th last year, 2010”.
  • 3: The sentences are roughly equivalent, but some important information differs/missing, e.g., “John said he is considered a witness but not a suspect.” and “‘He is not a suspect anymore.’ John said.”
  • 2: The sentences are not equivalent, but share some details, e.g., “They flew out of the nest in groups” and “They flew into the nest together”.
  • 1: The sentences are not equivalent, but are on the same topic, e.g., “The woman is playing the violin” and “The young lady enjoys listening to the guitar”.
  • 0: The sentences are on different topics, e.g., “John went horse back riding at dawn with a whole group of friends” and “Sunrise at dawn is a magnificent view to take in if you wake up early enough for it”.

The STS datasets

There were 86 runs submitted from more than 35 teams. Each team could submit up to three runs over sentence pairs drawn from four datasets, which included the following.

  • Headlines (750 pairs): a collection of pairs of headlines mined from several news sources by European Media Monitor using the RSS feed, e.g., “Syrian rebels move command from Turkey to Syria” and “Free Syrian Army moves headquarters from Turkey to Syria”.
  • SMT (750 pairs): a collection with sentence pairs the DARPA GALE program, where one sentence is the output of a machine translation system and the other is a reference translation provided by a human, for example, “The statement, which appeared on a website used by Islamists, said that Al-Qaeda fighters in Islamic Maghreb had attacked three army centers in the town of Yakouren in Tizi-Ouzo” and the sentence “the pronouncement released that the mujaheddin of al qaeda in islamic maghreb countries attacked 3 stations of the apostates in city of aekorn in tizi ouzou , which was posted upon the web page used by islamists”.
  • OnWN (561 pairs): a collection of sentence pairs describing word senses, one from OntoNotes and another from WordNet, e.g., “the act of advocating or promoting something” and “the act of choosing or selecting”.
  • FNWN (189 pairs): a collection of pairs of sentences describing word senses, one from FrameNet and another from WordNet, for example: “there exist a number of different possible events that may happen in the future. in most cases, there is an agent involved who has to consider which of the possible events will or should occur. a salient_entity which is deeply involved in the event may also be mentioned” and “doing as one pleases or chooses;”.

Our three systems

We used a different system for each of our allowed runs, PairingWords, Galactus and Saiyan. While they shared a lot of the same infrastructure, each used a different mix of ideas and features.

  • ParingWords was built using hybrid word similarity features derived from LSA and WordNet. It used a simple algorithm to pair words/phrases in two sentences and compute the average of word similarity of the resulting pairs. It imposes penalties on words that are not matched with the words weighted by their PoS and log frequency. No training data is used. An online demonstration system is available to experiment with the underlying word similarity model used by this approach.
  • Galactus used unigrams, bigrams, trigrams and skip bigrams derived from the two sentences and paired them with the highest similarity based on exact string match, corpus and Wordnet based similarity metrics. These, along with contrast scores derived from antonym pairs, were used as features to train a support vector regression model to predict the similarity scores.
  • Saiyan was a fine tuned version of galactus which used domain specific features and training data to train a support vector regression model to predict the similarity scores. (Scores for FNWN was directly used from the PairingWords run.)

The results

Here’s how our three runs ranked (out of 86) on each of the four different data sets and on the overall task (mean).

  our three systems
dataset PairingWords Galactus Saiyan
Headlines 3 7 1
OnWN glosses 4 11 35
FNWN glosses 1 3 2
SMT 8 11 16
mean 1 2 3

Over the next two weeks we will write a short system paper for the *SEM 2013, the Second Joint Conference on Lexical and Computational Semantics.

 

[1] Eneko Agirre, Daniel Cer, Mona Diab and Gonzalez-Agirre Aitor. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In Proc. 6th Int. Workshop on Semantic Evaluation (SemEval 2012), in conjunction with the First Joint Conf. on Lexical and Computational Semantics (*SEM 2012)., Montreal,Canada.

[2] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. 14th Int. Joint Conf. on Artificial Intelligence, 1995.

Google Wikilinks corpus

March 8th, 2013, by Tim Finin, posted in Google, Machine Learning, NLP, Semantic Web

Google released the Wikilinks Corpus, a collection of 40M disambiguated mentions from 10M web pages to 3M Wikipedia pages. This data can be used to train systems that do entity linking and cross-document co-reference, problems that Google researchers attacked with an earlier version of this data (see Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models).

You can download the data as ten 175MB files from and some addional tools from UMASS.

This is yet another example of the important role that Wikipedia continues to play in building a common, machine useable semantic substrate for human conceptualizations.

Computing word and phrase similarity

January 10th, 2013, by Tim Finin, posted in AI, Machine Learning, NLP, Semantic Web

Computing semantic similarity between words and phrases has important applications in natural language processing, information retrieval, and artificial intelligence. There are two prevailing approaches to computing word similarity, based on either using of a thesaurus (e.g., WordNet) or statistics from a large corpus. We provide a hybrid approach combining the two methods that is demonstrated on a web site through two services: one that returns a similarity score for two words or phrases and another that takes a word and shows a ranked list of the most similar words.

Our statistical method is based on distributional similarity and Latent Semantic Analysis. We further complement it with semantic relations extracted from WordNet. The whole process is automatic and can be trained using different corpora. We assume the semantics of a phrase is compositional on its component words and apply an algorithm to compute similarity between two phrases using word similarity.

The algorithms, implementation and data for this work were developed by Lushan Han as part of his research on developing easier ways to query linked open data collections. It was supported by grants from AFOSR (FA9550-08-1-0265), NSF (IIS-1250627) and a give from Microsoft. Contact umbcsim at cs.umbc.edu for more information.

Google releases dataset linking strings and concepts

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web, Wikipedia

Yesterday Google announced a very interesting resource with 175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles. This should be very useful for research on information extraction from text.

“We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article’s canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept’s url. Our database thus includes weights that measure degrees of association.”

The details of the data and how it was constructed are in an LREC 2012 paper by Valentin Spitkovsky and Angel Chang, A Cross-Lingual Dictionary for English Wikipedia Concepts. Get the data here.

Google Knowledge Graph: first impressions

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web

The Google’s Knowledge Graph showed up for me this morning — it’s been slowly rolling out since the announcement on Wednesday. It builds lots of research from human language technology (e.g., entity recognition and linking) and the semantic web (graphs of linked data). The slogan, “things not strings”, is brilliant and easily understood.

My first impression is that it’s fast, useful and a great accomplishment but leaves lots of room for improvement and expansion. That last bit is a good thing, at least for those of us in the R&D community. Here are some comments based on some initial experimentation.

GKG only works on searches that are simple entity mentions like people, places, organizations. It doesn’t do products (Toyota Camray), events (World War II), or diseases (diabetes) but does recognize that ‘Mercury’ could be a planet or an element.

It’s a bit aggressive about linking: when searching for “John Smith” it zeros in on the 17th century English explorer. Poor Professor Michael Jordan never get a chance, and providing context by adding Berkeley just suppresses the GKG sidebar. “Mitt” goes right to you know who. “George Bush” does lead to a disambiguation sidebar, though. Given that GKG doesn’t seem to allow for context information, the only disambiguating evidence it has is popularity (i.e., pagerank).

Speaking of context, the GKG results seem not to draw on user-specific information, like my location or past search history. When I search for “Columbia” from my location here in Maryland, it suggests “Columbia University” and “Columbia, South Carolina” and not “Columbia, Maryland” which is just five miles away from me.

Places include not just GPEs (geo-political entities) but also locations (Mars, Patapsco river) and facilities (MOMA, empire state building). To the GKG, the White House is just a place.

Organizations seem like a weak spot. It recognizes schools (UCLA) but company mentions seem not to be directly handled, not even for “Google”. A search for “NBA” suggests three “people associated with NBA” and “National Basketball Association” is not recognized. Forget finding out about the Cult of the Dead Cow.

Mike Bergman has some insights based on his exploration of the GKG in Deconstructing the Google Knowledge Graph

The use of structured and semi-structure knowledge in search is an exciting area. I expect we will see much more of this showing up in search engines, including Bing.

True Knowledge launches Evi question answering mobile app

January 29th, 2012, by Tim Finin, posted in Agents, AI, Mobile Computing, NLP, Semantic Web

UK semantic technology company True Knowledge has released Evi, a mobile app that competes with Siri.

The mobile app is available on the Android Market and on iTunes. You can pose queries to either by speaking or typing. The Android app uses Google’s ASR speech technology and the iTunes app uses Nuance.

True Knowledge has been developing a natural answering question answering system since 2007. You can query the True Knowledge online via a Web interface. Tty the following links for some examples:

The Evi app has a number of additional features beyond the Web-based True Knowledge QA system and these wil probably be expanded on in the months to come.

See the Technology Review story, New Virtual Helper Challenges Siri, for more information.

Ten years of words from ebiquity papers

September 16th, 2011, by Tim Finin, posted in Ebiquity, NLP, Semantic Web

Here’s a word cloud that visualizes the 200 most significant words extracted from over 400 papers from our research group over the past ten years. Significance was estimated by tf-idf where the idf data is from a collection of newswire articles (thanks Paul!). The word cloud was created with Wordle.

Mid-Atlantic student colloquium on speech, language and learning

September 2nd, 2011, by Tim Finin, posted in AI, Conferences, KR, Machine Learning, NLP

The First Mid-Atlantic Student Colloquium on Speech, Language and Learning is a one-day event to be held at the Johns Hopkins University in Baltimore on Friday, 23 September 2011. Its goal is to bring together students taking computational approaches to speech, language, and learning, so that they can introduce their research to the local student community, give and receive feedback, and engage each other in collaborative discussion. Attendance is open to all and free but space is limited, so online registration is requested by September 16. The program runs from 10:00am to 5:00pm and will include oral presentations, poster sessions, and breakout sessions.

Mid-Atlantic Student Colloquium on Speech, Language and Learning, 23 Sept 2011

July 13th, 2011, by Tim Finin, posted in AI, Machine Learning, NLP, Semantic Web

The Mid-Atlantic Student Colloquium on Speech, Language and Learning is a one day, free event bringing together faculty, researchers and students from universities in the Mid-Atlantic area working in Speech/Language/ML. The colloquium is an opportunity for students to present preliminary or completed work and to network with other students, faculty and researchers working in related fields. The event will be held in Baltimore MD at the Johns Hopkins University on Friday 23 September 2011.

Students are encouraged to submit one-page abstracts by Monday, August 15 describing ongoing, planned, or completed research projects, including previously published results and negative results. Student research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields, is welcome. Submissions and presentations must be made by students or postdocs. See the call for papers for more information.

Accepted submissions will be presented as posters and each will also be given a one-minute presentation during a poster spotlight session. A small number of submissions will be selected to be presented as talks, on the basis of diversity and general interest.

Student-led breakout sessions of one hour will also be held to discuss papers on topics of interest and stimulate interaction and discussion. Topics and suggested papers for breakout sessions should be submitted by students alongside abstracts.

The event is sponsored by the Human Language Technology Center of Excellence and the Center for Language and Speech Processing at the Johns Hopkins University.

You are currently browsing the archives for the NLP category.

  Home | Archive | Login | Feed