Automated Data Augmentation via Wikidata Relationships
Oyesh Singh, UMBC 10:30-11:30 Monday, 21 October 2019, ITE 346
With the increase in complexity of machine learning models, there is more need for data than ever. In order to fill this gap of annotated data-scarce situation, we look towards the ocean of free data present in Wikipedia and other WIkimedia resources. Wikipedia has an enormous amount of data in many languages along with the knowledge graph defined in Wikidata. In this presentation, I will explain how we utilized the Wikipedia/Wikidata data to boost the performance of BERT models for named entity recognition.
Understanding the Logical and Semantic Structure of Large Documents
Muhammad Mahbubur Rahman
11:00am Wednesday, 30 May 2018, ITE 325b
Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports poses challenges not present in short documents. The reasons behind this challenge are that large documents may be multi-themed, complex, noisy and cover diverse topics. This dissertation describes a framework that can analyze large documents, and help people and computer systems locate desired information in them. It aims to automatically identify and classify different sections of documents and understand their purpose within the document. A key contribution of this research is modeling and extracting the logical and semantic structure of electronic documents using deep learning techniques. The effectiveness and robustness of ?the framework is evaluated through extensive experiments on arXiv and requests for proposals datasets.
Committee Members: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Cynthia Matuszek, James Mayfield (JHU)
We describe the systems developed by the UMBC team for 2018 SemEval Task 8, SecureNLP (Semantic Extraction from CybersecUrity REports using Natural Language Processing). We participated in three of the sub-tasks: (1) classifying sentences as being relevant or irrelevant to malware, (2) predicting token labels for sentences, and (4) predicting attribute labels from the Malware Attribute Enumeration and Characterization vocabulary for defining malware characteristics. We achieved F1 scores of 50.34/18.0 (dev/test), 22.23 (test-data), and 31.98 (test-data) for Task1, Task2 and Task2 respectively. We also make our cybersecurity embeddings publicly available at https://bit.ly/cybr2vec.
2018 Mid-Atlantic Student Colloquium on Speech, Language and Learning
The 2018 Mid-Atlantic Student Colloquium on Speech, Language and Learning (MASC-SLL) is a student-run, one-day event on speech, language & machine learning research to be held at the University of Maryland, Baltimore County (UMBC) from 10:00am to 6:00pm on Saturday May 12. There is no registration charge and lunch and refreshments will be provided. Students, postdocs, faculty and researchers from universities & industry are invited to participate and network with other researchers working in related fields.
Students and postdocs are encouraged to submit abstracts describing ongoing, planned, or completed research projects, including previously published results and negative results. Research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields is welcome. Submissions and presentations must be made by students or postdocs. Accepted submissions will be presented as either posters or talks.
Jennifer Sleeman receives AI for Earth grant from Microsoft
Visiting Assistant Professor Jennifer Sleeman (Ph.D. ’17) has been awarded a grant from Microsoft as part of its ‘AI for Earth’ program. Dr. Sleeman will use the grant to continue her research on developing algorithms to model how scientific disciplines such as climate change evolve and predict future trends by analyzing the text of articles and reports and the papers they cite.
AI for Earth is a Microsoft program aimed at empowering people and organizations to solve global environmental challenges by increasing access to AI tools and educational opportunities, while accelerating innovation. Via the Azure for Research AI for Earth award program, Microsoft provides selected researchers and organizations access to its cloud and AI computing resources to accelerate, improve and expand work on climate change, agriculture, biodiversity and/or water challenges.
UMBC is among the first grant recipients of AI for Earth, first launched in July 2017. The grant process was a competitive and selective process and was awarded in recognition of the potential of the work and power of AI to accelerate progress.
As part of her dissertation research, Dr. Sleeman developed algorithms using dynamic topic modeling to understand influence and predict future trends in a scientific discipline. She applied this to the field of climate change and used assessment reports of the Intergovernmental Panel on Climate Change (IPCC) and the papers they cite. Since 1990, an IPCC report has been published every five years that includes four separate volumes, each of which has many chapters. Each report cites tens of thousands of research papers, which comprise a correlated dataset of temporally grounded documents. Her custom dynamic topic modeling algorithm identified topics for both datasets and apply cross-domain analytics to identify the correlations between the IPCC chapters and their cited documents. The approach reveals both the influence of the cited research on the reports and how previous research citations have evolved over time.
In this week’s meeting, Srishty Saha, Michael Aebig and Jiayong Lin will talk about their work on extracting knowledge from the US FAR System.
Automated Knowledge Extraction from the Federal Acquisition Regulations System
Srishty Saha, Michael Aebig and Jiayong Lin
11am-12pm Monday, 25 September 2017, ITE346, UMBC
The Federal Acquisition Regulations System (FARS) within the Code of Federal Regulations (CFR) includes facts and rules for individuals and organizations seeking to do business with the US Federal government. Parsing and extracting knowledge from such lengthy regulation documents is currently done manually and is time and human intensive. Hence, developing a cognitive assistant for automated analysis of such legal documents has become a necessity. We are developing a semantically rich legal knowledge base representing legal entities and their relationships, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules.
Paul Cuddihy and Justin McHugh
GE Global Research Center, Niskayuna, NY
10:00-11:00 Tuesday, 4 April 2017, ITE 346, UMBC
Paul Cuddihy is a senior computer scientist and software systems architect in AI and Learning Systems at the GE Global Research Center in Niskayuna, NY. He earned an M.S. in Computer Science from Rochester Institute of Technology. The focus of his twenty-year career at GE Research has ranged from machine learning for medical imaging equipment diagnostics, monitoring and diagnostic techniques for commercial aircraft engines, modeling techniques for monitoring seniors living independently in their own homes, to parallel execution of simulation and prediction tasks, and big data ontologies. He is one of the creators of the open source software “Semantics Toolkit” (SemTk) which provides a simplified interface to the semantic tech stack, opening its use to a broader set of users by providing features such as drag-and-drop query generation and data ingestion. Paul has holds over twenty U.S. patents.
Justin McHugh is computer scientist and software systems architect working in the AI and Learning Systems group at GE Global Research in Niskayuna, NY. Justin attended the State University of New York at Albany where he earned an M.S in computer science. He has worked as a systems architect and programmer for large scale reporting, before moving into the research sector. In the six years since, he has worked on complex system integration, Big Data systems and knowledge representation/querying systems. Justin is one of the architects and creators of SemTK (the Semantics Toolkit), a toolkit aimed at making the power of the semantic web stack available to programmers, automation and subject matter experts without their having to be deeply invested in the workings of the Semantic Web.
11:00-1:00 Monday, 12 December 2016, ITE325b, UMBC
Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.
We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Cynthia Matuszek, James Mayfield (JHU)
1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC
Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.
An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)
“Alexa, get my coffee”:
Using the Amazon Echo in Research
10:30am Monday, 7 December 2015, ITE 346
The Amazon Echo is a remarkable example of language-controlled, user-centric technology, but also a great example of how far such devices have to go before they will fulfill the longstanding promise of intelligent assistance. In this talk, we will describe the Interactive Robotics and Language Lab‘s work with the Echo, with an emphasis on the practical aspects of getting it set up for development and adding new capabilities. We will demonstrate adding a simple new interaction, and then lead a brainstorming session on future research applications.
Megan Zimmerman is a UMBC undergrad majoring in computer science working on interpreting language about tasks at varying levels of abstraction, with a focus on interpreting abstract statements as possible task instructions in assistive technology.
Extracting Structured Summaries
from Text Documents
Dr. Zareen Syed
Research Assistant Professor, UMBC
10:30am, Monday, 9 November 2015, ITE 346, UMBC
In this talk, Dr. Syed will present unsupervised approaches for automatically extracting structured summaries composed of slots and fillers (attributes and values) and important facts from articles, thus effectively reducing the amount of time and effort spent on gathering intelligence by humans using traditional keyword based search approaches. The approach first extracts important concepts from text documents and links them to unique concepts in Wikitology knowledge base. It then exploits the types associated with the linked concepts to discover candidate slots and fillers. Finally it applies specialized approaches for ranking and filtering slots to select the most relevant slots to include in the structured summary.
Compared with the state of the art, Dr. Syed’s approach is unrestricted, i.e., it does not require manually crafted catalogue of slots or relations of interest that may vary over different domains. Unlike Natural Language Processing (NLP) based approaches that require well-formed sentences, the approach can be applied on semi-structured text. Furthermore, NLP based approaches for fact extraction extract lexical facts and sentences that require further processing for disambiguating and linking to unique entities and concepts in a knowledge base, whereas, in Dr. Syed’s approach, concept linking is done as a first step in the discovery process. Linking concepts to a knowledge base provides the additional advantage that the terms can be explicitly linked or mapped to semantic concepts in other ontologies and are thus available for reasoning in more sophisticated language understanding systems.