November 17th, 2017
Discovering Scientific Influence using Cross-Domain Dynamic Topic Modeling
We describe an approach using dynamic topic modeling to model influence and predict future trends in a scientific discipline. Our study focuses on climate change and uses assessment reports of the Intergovernmental Panel on Climate Change (IPCC) and the papers they cite. Since 1990, an IPCC report has been published every five years that includes four separate volumes, each of which has many chapters. Each report cites tens of thousands of research papers, which comprise a correlated dataset of temporally grounded documents. We use a custom dynamic topic modeling algorithm to generate topics for both datasets and apply crossdomain analytics to identify the correlations between the IPCC chapters and their cited documents. The approach reveals both the influence of the cited research on the reports and how previous research citations have evolved over time. For the IPCC use case, the report topic model used 410 documents and a vocabulary of 5911 terms while the citations topic model was based on 200K research papers and a vocabulary more than 25K terms. We show that our approach can predict the importance of its extracted topics on future IPCC assessments through the use of cross domain correlations, Jensen-Shannon divergences and cluster analytics.
September 5th, 2017
Cognitive Assistance for Automating the Analysis of the Federal Acquisition Regulations System
Government regulations are critical to understanding how to do business with a government entity and receive other bene?ts. However, government regulations are also notoriously long and organized in ways that can be confusing for novice users. Developing cognitive assistance tools that remove some of the burden from human users is of potential bene?t to a variety of users. The volume of data found in United States federal government regulation suggests a multiple-step approach to process the data into machine readable text, create an automated legal knowledge base capturing various facts and rules, and eventually building a legal question and answer system to acquire understanding from various regulations and provisions. Our work discussed in this paper represents our initial efforts to build a framework for Federal Acquisition Regulations System (Title 48, Code of Federal Regulations) in order to create an efficient legal knowledge base representing relationships between various legal elements, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules.
July 16th, 2017
Deep Representation of Lyrical Style and Semantics for Music Recommendation
Abhay L. Kashyap
11:00-1:00 Thursday, 20 July 2017, ITE 346
In the age of music streaming, the need for effective recommendations is important for music discovery and a personalized user experience. Collaborative filtering based recommenders suffer from popularity bias and cold-start which is commonly mitigated by content features. For music, research in content based methods have mainly been focused in the acoustic domain while lyrical content has received little attention. Lyrics contain information about a song’s topic and sentiment that cannot be easily extracted from the audio. This is especially important for lyrics-centric genres like Rap, which was the most streamed genre in 2016. The goal of this dissertation is to explore and evaluate different lyrical content features that could be useful for content, context and emotion based models for music recommendation systems.
With Rap as the primary use case, this dissertation focuses on featurizing two main aspects of lyrics; its artistic style of composition and its semantic content. For lyrical style, a suite of high level rhyme density features are extracted in addition to literary features like the use of figurative language, profanity and vocabulary strength. In contrast to these engineered features, Convolutional Neural Networks (CNN) are used to automatically learn rhyme patterns and other relevant features. For semantics, lyrics are represented using both traditional IR techniques and the more recent neural embedding methods.
These lyrical features are evaluated for artist identification and compared with artist and song similarity measures from a real-world collaborative filtering based recommendation system from Last.fm. It is shown that both rhyme and literary features serve as strong indicators to characterize artists with feature learning methods like CNNs achieving comparable results. For artist and song similarity, a strong relationship was observed between these features and the way users consume music while neural embedding methods significantly outperformed LSA. Finally, this work is accompanied by a web-application, Rapalytics.com, that is dedicated to visualizing all these lyrical features and has been featured on a number of media outlets, most notably, Vox, attn: and Metro.
Committee: Drs. Tim Finin (chair), Anupam Joshi, Tim Oates, Cynthia Matuszek and Pranam Kolari (Walmart Labs)
June 27th, 2017
Ph.D. Dissertation Defense
Dynamic Data Assimilation for Topic Modeling
9:00am Thursday, 29 June 2017, ITE 325b, UMBC
Understanding how a particular discipline such as climate science evolves over time has received renewed interest. By understanding this evolution, predicting the future direction of that discipline becomes more achievable. Dynamic Topic Modeling (DTM) has been applied to a number of disciplines to model topic evolution as a means to learn how a particular scientific discipline and its underlying concepts are changing. Understanding how a discipline evolves, and its internal and external influences, can be complicated by how the information retrieved over time is integrated. There are different techniques used to integrate sources of information, however, less research has been dedicated to understanding how to integrate these sources over time. The method of data assimilation is commonly used in a number of scientific disciplines to both understand and make predictions of various phenomena, using numerical models and assimilated observational data over time.
In this dissertation, I introduce a novel algorithm for scientific data assimilation, called Dynamic Data Assimilation for Topic Modeling (DDATM), which uses a new cross-domain divergence method (CDDM) and DTM. By using DDATM, observational data in the form of full-text research papers can be assimilated over time starting from an initial model. DDATM can be used as a way to integrate data from multiple sources and, due to its robustness, can exploit the assimilating observational information to better tolerate missing model information. When compared with a DTM model, the assimilated model is shown to have better performance using standard topic modeling measures, including perplexity and topic coherence. The DDATM method is suitable for prediction and results in higher likelihood for subsequent documents. DDATM is able to overcome missing information during the assimilation process when compared with a DTM model. CDDM generalizes as a method that can also bring together multiple disciplines into one cohesive model enabling the identification of related concepts and documents across disciplines and time periods. Finally, grounding the topic modeling process with an ontology improves the quality of the topics and enables a more granular understanding of concept relatedness and cross-domain influence.
The results of this dissertation are demonstrated and evaluated by applying DDATM to 30 years of reports from the Intergovernmental Panel on Climate Change (IPCC) along with more than 150,000 documents that they cite to show the evolution of the physical basis of climate change.
Committee Members: Drs. Tim Finin (co-advisor), Milton Halem (co-advisor), Anupam Joshi, Tim Oates, Cynthia Matuszek, Mark Cane, Rafael Alonso
June 16th, 2017
UMBC Data Science Graduate Programs
UMBC’s Data Science Master’s program prepares students from a wide range of disciplinary backgrounds for careers in data science. In the core courses, students will gain a thorough understanding of data science through classes that highlight machine learning, data analysis, data management, ethical and legal considerations, and more.
Students will develop an in-depth understanding of the basic computing principles behind data science, to include, but not limited to, data ingestion, curation and cleaning and the 4Vs of data science: Volume, Variety, Velocity, Veracity, as well as the implicit 5th V — Value. Through applying principles of data science to the analysis of problems within specific domains expressed through the program pathways, students will gain practical, real world industry relevant experience.
The MPS in Data Science is an industry-recognized credential and the program prepares students with the technical and management skills that they need to succeed in the workplace.
For more information and to apply online, see the Data Science MPS site.
June 15th, 2017
The topic of this month’s Data Science MD meetup is Getting Started with NLP, Sentiment Analysis and OpenNLP. The meeting will be 6:30-9:00pm, Monday, June 19 in Building 200 Room E100 at the JHU Applied Physics Laboratory. The meeting starts with networking and food and feature talks by two practitioners.
Brian Sacash (Deloitte & Touche): NLP and Sentiment Analysis
Natural Language Processing, the analysis of language, can be challenging if you don’t know where to start. Brian will walk through the Natural Language Tool Kit (NLTK), a Python library built for language analysis, and cover its core functionality. Through live coding he will demonstrate how to build a simple sentiment analysis engine from scratch.
Daniel Russ (NIH): It Takes a Village To Solve A Problem in Data Science
The talk will discuss a scientific case study in data science, computer-based occupational coding of free text job histories taken during epidemiological research studies. Beginning with a rationale for occupational coding, how the coding is performed, and how SOCcer is built on top of Apache OpenNLP. Throughout the talk, I will try to emphasize the importance of working as an interdisciplinary team.
See the meetup announcement to RSVP and get directions and more information.
May 15th, 2017
Ph.D. Dissertation Proposal
Modeling and Extracting information about Cybersecurity Events from Text
Tuesday, 16 May 2017, ITE 325, UMBC
People rely on the Internet to carry out much of the their daily activities such as banking, ordering food and socializing with their family and friends. The technology facilitates our lives, but also comes with many problems, including cybercrimes, stolen data and identity theft. With the large and increasing number of transaction done every day, the frequency of cybercrime events is also increasing. Since the number of security-related events is too high for manual review and monitoring, we need to train machines to be able to detect and gather data about potential cybersecurity threats. To support machines that can identify and understand threats, we need standard models to store the cybersecurity information and information extraction systems that can collect information to populate the models with data from text.
This dissertation will make two major contributions. The first is to extend our current cyber security ontologies with better models for relevant events, from atomic events like a login attempt, to an extended but related series of events that make up a campaign, to generalized events, such as an increase in denial-of-service attacks originating from a particular region of the world targeted at U.S. financial institutions. The second is the design and implementation of a event extraction system that can extract information about cybersecurity events from text and populated a knowledge graph using our cybersecurity event ontology. We will extend our previous work on event extraction that detected human activity events from news and discussion forums. A new set of features and learning algorithms will be introduced to improve the performance and adapt the system to cybersecurity domain. We believe that this dissertation will be useful for cybersecurity management in the future. It will quickly extract cybersecurity events from text and fill in the event ontology.
Committee: Drs. Tim Finin (chair), Anupam Joshi, Tim Oates and Karuna Joshi
May 15th, 2017
Jennifer Sleeman, Milton Halem, Tim Finin, and Mark Cane, Modeling the Evolution of Climate Change Assessment Research Using Dynamic Topic Models and Cross-Domain Divergence Maps, AAAI Spring Symposium on AI for Social Good, AAAI Press, March, 2017.
Climate change is an important social issue and the subject of much research, both to understand the history of the Earth’s changing climate and to foresee what changes to expect in the future. Approximately every five years starting in 1990 the Intergovernmental Panel on Climate Change (IPCC) publishes a set of reports that cover the current state of climate change research, how this research will impact the world, risks, and approaches to mitigate the effects of climate change. Each report supports its findings with hundreds of thousands of citations to scientific journals and reviews by governmental policy makers. Analyzing trends in the cited documents over the past 30 years provides insights into both an evolving scientific field and the climate change phenomenon itself. Presented in this paper are results of dynamic topic modeling to model the evolution of these climate change reports and their supporting research citations over a 30 year time period. Using this technique shows how the research influences the assessment reports and how trends based on these influences can affect future assessment reports. This is done by calculating cross-domain divergences between the citation domain and the assessment report domain and by clustering documents between domains. This approach could be applied to other social problems with similar structure such as disaster recovery.
May 13th, 2017
Sudip Mittal, Aditi Gupta, Karuna Pande Joshi, Claudia Pearce and Anupam Joshi, A Question and Answering System for Management of Cloud Service Level Agreements, IEEE International Conference on Cloud Computing, June 2017.
One of the key challenges faced by consumers is to efficiently manage and monitor the quality of cloud services. To manage service performance, consumers have to validate rules embedded in cloud legal contracts, such as Service Level Agreements (SLA) and Privacy Policies, that are available as text documents. Currently this analysis requires significant time and manual labor and is thus inefficient. We propose a cognitive assistant that can be used to manage cloud legal documents by automatically extracting knowledge (terms, rules, constraints) from them and reasoning over it to validate service performance. In this paper, we present this Question and Answering (Q&A) system that can be used to analyze and obtain information from the SLA documents. We have created a knowledgebase of Cloud SLAs from various providers which forms the underlying repository of our Q&A system. We utilized techniques from natural language processing and semantic web (RDF, SPARQL and Fuseki server) to build our framework. We also present sample queries on how a consumer can compute metrics such as service credit.
March 17th, 2017
The Semantics Toolkit
Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY
10:00-11:00 Tuesday, 4 April 2017, ITE 346, UMBC
Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies. He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion. Paul has holds over twenty U.S. patents.
Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.
December 23rd, 2016
One way of understanding the evolution of science within a particular scientific discipline is by studying the temporal influences that research publications had on that discipline. We provide a methodology for conducting such an analysis by employing cross-domain topic modeling and local cluster mappings of those publications with the historical texts to understand exactly when and how they influenced the discipline. We apply our method to the Intergovernmental Panel on Climate Change (IPCC) Assessment Reports and the citations therein. The IPCC reports were compiled by thousands of Earth scientists and the assessments were issued approximately every five years over a 30 year span, and includes over 200,000 research papers cited by these scientists.
December 9th, 2016
Understanding the Logical and Semantic
Structure of Large Documents
11:00-1:00 Monday, 12 December 2016, ITE325b, UMBC
Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.
We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Cynthia Matuszek, James Mayfield (JHU)