UMBC ebiquity

Archive for the 'NLP' Category

talk: Topic Modeling for Analyzing Document Collection, 11am Mon 3/16

May 12th, 2016, by Tim Finin, posted in Datamining, High performance computing, Machine Learning, NLP


Topic Modeling for Analyzing Document Collection

Mitsunori Ogihara
Computer Science, University of Miami

11:00am Monday, 16 May 2016, ITE 325b, UMBC

Topic modeling (in particular, Latent Dirichlet Analysis) is a technique for analyzing a large collection of documents. In topic modeling we view each document as a frequency vector over a vocabulary and each topic as a static distribution over the vocabulary. Given a desired number, K, of document classes, a topic modeling algorithm attempts to estimate concurrently K static distributions and for each document how much each K class contributes. Mathematically, this is the problem of approximating the matrix generated by stacking the frequency vectors into the product of two non-negative matrices, where both the column dimension of the first matrix and the row dimension of the second matrix are equal to K. Topic modeling is gaining popularity recently, for analyzing large collections of documents.

In this talk I will present some examples of applying topic modeling: (1) a small sentiment analysis of a small collection of short patient surveys, (2) exploratory content analysis of a large collection of letters, (3) document classification based upon topics and other linguistic features, and (4) exploratory analysis of a large collection of literally works. I will speak not only the exact topic modeling steps but also all the preprocessing steps for preparing the documents for topic modeling.

Mitsunori Ogihara is a Professor of Computer Science at the University of Miami, Coral Gables, Florida. There he directs the Data Mining Group in the Center for Computational Science, a university-wide organization for providing resources and consultation for large-scale computation. He has published three books and approximately 190 papers in conferences and journals. He is on the editorial board for Theory of Computing Systems and International Journal of Foundations of Computer Science. Ogihara received a Ph.D. in Information Sciences from Tokyo Institute of Technology in 1993 and was a tenure-track/tenured faculty member in the Department of Computer Science at the University of Rochester from 1994 to 2007.

Representing and Reasoning with Temporal Properties/Relations in OWL/RDF

May 1st, 2016, by Tim Finin, posted in KR, NLP, Ontologies, Semantic Web

Representing and Reasoning with Temporal
Properties/Relations in OWL/RDF

Clare Grasso

10:30-11:30 Monday, 2 May 2016, ITE346

OWL ontologies offer the means for modeling real-world domains by representing their high-level concepts, properties and interrelationships. These concepts and their properties are connected by means of binary relations. However, this assumes that the model of the domain is either a set of static objects and relationships that do not change over time, or a snapshot of these objects at a particular point in time. In general, relationships between objects that change over time (dynamic properties) are not binary relations, since they involve a temporal interval in addition to the object and the subject. Representing and querying information evolving in time requires careful consideration of how to use OWL constructs to model dynamic relationships and how the semantics and reasoning capabilities within that architecture are affected.

Image description using deep neural networks

February 27th, 2016, by Tim Finin, posted in AI, Machine Learning, NLP

Image description using deep neural networks

Sunil Gandhi
10:30 am, Monday, February 29, 2016 ITE 346

With the explosion of image data on the internet, there has been a need for automatic generation of image descriptions. In this project we use deep neural networks for extracting vectors from images and we use them to generate text that describes the image. The model that we built makes use of the pre-trained VGGNET- a model for image classification and a recurrent neural network (RNN) for language modelling. The combination of the two neural networks provides a multimodal embedding between image vectors and word vectors. We trained the model on 8000 images from the Flickr8k dataset and we present our results on test images downloaded from the Internet. We provide a web-service for image description generation that takes the image URL as input and provides image description and image categories as output. Through our service, a user can correct the description automatically generated by the system so that we can improve our model using corrected description.

Sunil Gandhi is a Computer Science Ph.D. student at UMBC who is part of the  Cognition Robotics and Learning Lab (CORAL) research lab.

Alexa, get my coffee: Using the Amazon Echo in Research

December 3rd, 2015, by Tim Finin, posted in AI, NLP, NLP, Semantic Web

“Alexa, get my coffee”:
Using the Amazon Echo in Research

Megan Zimmerman

10:30am Monday, 7 December 2015, ITE 346

The Amazon Echo is a remarkable example of language-controlled, user-centric technology, but also a great example of how far such devices have to go before they will fulfill the longstanding promise of intelligent assistance. In this talk, we will describe the Interactive Robotics and Language Lab‘s work with the Echo, with an emphasis on the practical aspects of getting it set up for development and adding new capabilities. We will demonstrate adding a simple new interaction, and then lead a brainstorming session on future research applications.

Megan Zimmerman is a UMBC undergrad majoring in computer science working on interpreting language about tasks at varying levels of abstraction, with a focus on interpreting abstract statements as possible task instructions in assistive technology.

talk: Introduction to Deep Learning

November 20th, 2015, by Tim Finin, posted in Machine Learning, NLP

Introduction to Deep Learning

Zhiguang Wang and Hang Gao

10:00am Monday, 23 November 2015, ITE 346

Deep learning has been a hot topic and all over the news lately. It is introduced with the ambition of moving Machine Learning closer to Artificial Intelligence, one of its original goals. Since the introduction of the concept of deep learning, various relevant algorithms are proposed and have achieved significant success in their corresponding areas. This talk aims at providing a brief overview of most common deep learning algorithms, along with their application on different tasks.

In this talk, Steve (Zhiguang Wang) will give a brief introduction about the application of deep learning algorithms in computer vision and speech, some basic viewpoints about training methods and attacking the non-convexity in deep neural nets along with some misc about deep learning.

On the other hand, Hang Gao will talk about common application of deep learning algorithms in Natural Language Processing, covering semantic, syntactic and sentiment analysis. He will also give a discussion on the limits of current application of deep learning algorithms in NLP and provide some ideas on possible future trend.

Knowledge Extraction from Cloud Service Level Agreements

November 1st, 2015, by Tim Finin, posted in cloud computing, NLP, Policy

Sudip Mittal, Karuna Pande Joshi, Claudia Pearce, and Anupam Joshi, Parallelizing Natural Language Techniques for Knowledge Extraction from Cloud Service Level Agreements, IEEE International Conference on Big Data, October, 2015.

To efficiently utilize their cloud based services, consumers have to continuously monitor and manage the Service Level Agreements (SLA) that define the service performance measures. Currently this is still a time and labor intensive process since the SLAs are primarily stored as text documents. We have significantly automated the process of extracting, managing and monitoring cloud SLAs using natural language processing techniques and Semantic Web technologies. In this paper we describe our prototype system that uses a Hadoop cluster to extract knowledge from unstructured legal text documents. For this prototype we have considered publicly available SLA/terms of service documents of various cloud providers. We use established natural language processing techniques in parallel to speed up cloud legal knowledge base creation. Our system considerably speeds up knowledge base creation and can also be used in other domains that have unstructured data.

The KELVIN Information Extraction System

October 30th, 2015, by Tim Finin, posted in NLP, NLP, Semantic Web

In this week’s ebiquity lab meeting (10:30am Monday Nov 2), Tim Finin will describe recent work on the Kelvin information extraction system and its performance in two tasks in the 2015 NIST Text Analysis Conference. Kelvin has been under development at the JHU Human Language Center of Excellence for several years. Kelvin reads documents in several languages and extracts entities and relations between them. This year it was used for the Coldstart Knowledge Base Population and Trilingual Entity Discovery and Linking tasks. Key components in the tasks are a system for cross-document coreference and another that links entities to entries in the Freebase knowledge base.

Lyrics Augmented Multi-modal Music Recommendation

October 29th, 2015, by Tim Finin, posted in Machine Learning, NLP, RDF, Semantic Web

Lyrics Augmented Multi-modal
Music Recommendation

Abhay Kashyap

1:00pm Friday 30 October, ITE 325b

In an increasingly mobile and connected world, digital music consumption has rapidly increased. More recently, faster and cheaper mobile bandwidth has given the average mobile user the potential to access large troves of music through streaming services like Spotify and Google Music that boast catalogs with tens of millions of songs. At this scale, effective music recommendation is critical for music discovery and personalized user experience.

Recommenders that rely on collaborative information suffer from two major problems: the long tail problem, which is induced by popularity bias, and the cold start problem caused by new items with no data. In such cases, they fall back on content to compute similarity. For music, content based features can be divided into acoustic and textual domains. Acoustic features are extracted from the audio signal while textual features come from song metadata, lyrical content, collaborative tags and associated web text.

Research in content based music similarity has largely been focused in the acoustic domain while text based features have been limited to metadata, tags and shallow methods for web text and lyrics. Song lyrics house information about the sentiment and topic of a song that cannot be easily extracted from the audio. Past work has shown that even shallow lyrical features improved audio-only features and in some tasks like mood classification, outperformed audio-only features. In addition, lyrics are also easily available which make them a valuable resource and warrant a deeper analysis.

The goal of this research is to fill the lyrical gap in existing music recommender systems. The first step is to build algorithms to extract and represent the meaning and emotion contained in the song’s lyrics. The next step is to effectively combine lyrical features with acoustic and collaborative information to build a multi-modal recommendation engine.

For this work, the genre is restricted to Rap because it is a lyrics-centric genre and techniques built for Rap can be generalized to other genres. It was also the highest streamed genre in 2014, accounting for 28.5% of all music streamed. Rap lyrics are scraped from dedicated lyrics websites like and while the semantic knowledge base comprising artists, albums and song metadata come from the MusicBrainz project. Acoustic features are directly used from EchoNest while collaborative information like tags, plays, co-plays etc. come from

Preliminary work involved extraction of compositional style features like rhyme patterns and density, vocabulary size, simile and profanity usage from over 10,000 songs by over 150 artists. These features are available for users to browse and explore through interactive visualizations on Song semantics were represented using off-the-shelf neural language based vector models (doc2vec). Future work will involve building novel language models for lyrics and latent representations for attributes that is driven by collaborative information for multi-modal recommendation.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Pranam Kolari (WalmartLabs), Cynthia Matuszek and Tim Oates

Robust Semantic Text Similarity Using LSA, Machine Learning and Linguistic Resources

October 24th, 2015, by Tim Finin, posted in AI, NLP

Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi and Tim Finin, Robust Semantic Text Similarity Using LSA, Machine Learning and Linguistic Resources, Language Resources and Evaluation, Springer, to appear.

Semantic textual similarity is a measure of the degree of semantic equivalence between two pieces of text. We describe the SemSim system and its performance in the *SEM~2013~and SemEval-2014~tasks on semantic textual similarity. At the core of our system lies a robust distributional word similarity component that combines Latent Semantic Analysis and machine learning augmented with data from several linguistic resources. We used a simple term alignment algorithm to handle longer pieces of text. Additional wrappers and resources were used to handle task specific challenges that include processing Spanish text, comparing text sequences of different lengths, handling informal words and phrases, and matching words with sense definitions. In the *SEM~2013~task on Semantic Textual Similarity, our best performing system ranked first among the 89 submitted runs. In the SemEval-2014~task on Multilingual Semantic Textual Similarity, we ranked a close second in both the English and Spanish subtasks. In the SemEval-2014~task on Cross–Level Semantic Similarity, we ranked first in Sentence–Phrase, Phrase-Word, and Word-Sense subtasks and second in the Paragraph-Sentence subtask.

Demystifying Word2Vec: A Hands-on Tutorial

October 16th, 2015, by Tim Finin, posted in Big data, Machine Learning, NLP, NLP

Demystifying Word2Vec – A Hands-on Tutorial

Abhay Kashyap

10:30am Monday, 19 October 2015 **ITE 456**

In the world of NLP, Word2Vec is one of the coolest kids in town! But what exactly is it and how does it work? More importantly, how is it used/useful?

For the first 10-15 minutes, we will go over distributional an distributed representation of words and the neural language model behind Word2Vec. We will also briefly look at doc2vec, the extension of Word2Vec for longer pieces of text.

For the remainder of the time (45-60 minutes), we will get our feet wet by running Word2Vec on a dataset which will then be followed by discussions about potential ways it can be useful for your own work.

What to bring – Any computing machine with Python installed, lots of curiosity and some delicious snacks for me maybe? We will use the excellent gensim package for python to run Word2Vec along with cython to speed things up. If you aren’t familiar with Python or don’t like it, no worries! It’s really just 5-6 lines of code! The training dataset will be provided. If you wish to bring your own, that’s cool too.

NOTE: We will hold this week’s Ebiquity meeting in ITE 456.

Beyond NER: Towards Semantics in Clinical Text

September 29th, 2015, by Tim Finin, posted in NLP, Ontologies, RDF, Semantic Web

Clare Grasso, Anupam Joshi and ELior Siegel, Beyond NER: Towards Semantics in Clinical Text, Biomedical Data Mining, Modeling, and Semantic Integration (BDM2I); co-located with the 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA.

While clinical text NLP systems have become very effective in recognizing named entities in clinical text and mapping them to standardized terminologies in the normalization process, there remains a gap in the ability of extractors to combine entities together into a complete semantic representation of medical concepts that contain multiple attributes each of which has its own set of allowed named entities or values. Furthermore, additional domain knowledge may be required to determine the semantics of particular tokens in the text that take on special meanings in relation to this concept. This research proposes an approach that provides ontological mappings of the surface forms of medical concepts that are of the UMLS semantic class signs/symptoms. The mappings are used to extract and encode the constituent set of named entities into interoperable semantic structures that can be linked to other structured and unstructured data for reuse in research and analysis.

Hot Stuff at ColdStart

June 8th, 2015, by Tim Finin, posted in AI, KR, NLP, NLP, Ontologies

Cold Start

Coldstart is a task in the NIST Text Analysis Conference’s Knowledge Base Population suite that combines entity linking and slot filling to populate an empty knowledge base using a predefined ontology for the facts and relations. This paper describes a system developed by the Human Language Technology Center of Excellence at Johns Hopkins University for the 2014 Coldstart task.

Tim Finin, Paul McNamee, Dawn Lawrie, James Mayfield and Craig Harman, Hot Stuff at Cold Start: HLTCOE participation at TAC 2014, 7th Text Analysis Conference, National Institute of Standards and Technology, Nov. 2014.

The JHU HLTCOE participated in the Cold Start task in this year’s Text Analysis Conference Knowledge Base Population evaluation. This is our third year of participation in the task, and we continued our research with the KELVIN system. We submitted experimental variants that explore use of forward-chaining inference, slightly more aggressive entity clustering, refined multiple within-document conference, and prioritization of relations extracted from news sources.

You are currently browsing the archives for the NLP category.

  Home | Archive | Login | Feed