UMBC ebiquity
UMBC eBiquity Blog

PhD Proposal: Understanding the Logical and Semantic Structure of Large Documents

Tim Finin, 8:54am 9 December 2016

business documents

Dissertation Proposal

Understanding the Logical and Semantic
Structure of Large Documents 

Muhammad Mahbubur Rahman

11:00-1:00 Monday, 12 December 2016, ITE325b, UMBC

Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.

We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Cynthia Matuszek, James Mayfield (JHU)


 

PhD Proposal: Ankur Padia, Dealing with Dubious Facts in Knowledge Graphs

Tim Finin, 9:25pm 29 November 2016

the skeptic

Dissertation Proposal

Dealing with Dubious Facts
in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.

An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)


 

Understanding Large Documents

Tim Finin, 10:49am 28 November 2016

business documents

In this week’s ebiquity meeting, Muhammad Mahbubur Rahman will about about his work on understanding large documents, such as business RFPs.

Large Document Understanding

Muhammad Mahbubur Rahman

Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents such as legal documents, reports, business opportunities, proposals and technical manuals is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics.

We aim to automatically identify and classify a document’s sections and subsections, infer their structure and annotate them with semantic labels to understand the semantic structure of a document. This document’s structure understanding will significantly benefit and inform a variety of applications such as information extraction and retrieval, document categorization and clustering, document summarization, fact and relation extraction, text analysis and question answering.


 

PhD proposal: Sandeep Nair Narayanan, Cognitive Analytics Framework to Secure Internet of Things

Tim Finin, 12:47pm 26 November 2016

cognitive car

Dissertation Proposal

Cognitive Analytics Framework to Secure Internet of Things

Sandeep Nair Narayanan

1:00-3:30pm, Monday, 28 November 2016, ITE 325b

Recent years have seen the rapid growth and widespread adoption of Internet of Things in a wide range of domains including smart homes, healthcare, automotive, smart farming and smart grids. The IoT ecosystem consists of devices like sensors, actuators and control systems connected over heterogeneous networks. The connected devices can be from different vendors with different capabilities in terms of power requirements, processing capabilities, etc. As such, many security features aren’t implemented on devices with lesser processing capabilities. The level of security practices followed during their development can also be different. Lack of over the air update for firmware also pose a very big security threat considering their long-term deployment requirements. Device malfunctioning is yet another threat which should be considered. Hence, it is imperative to have an external entity which monitors the ecosystem and detect attacks and anomalies.

In this thesis, we propose a security framework for IoTs using cognitive techniques. While anomaly detection has been employed in various domains, some challenges like online approach, resource constraints, heterogeneity, distributed data collection etc. are unique to IoTs and their predecessors like wireless sensor networks. Our framework will have an underlying knowledge base which has the domain-specific information, a hybrid context generation module which generates complex contexts and a fast reasoning engine which does logical reasoning to detect anomalous activities. When raw sensor data arrives, the hybrid context generation module queries the knowledge base and generates different simple local contexts using various statistical and machine learning models. The inferencing engine will then infer global complex contexts and detects anomalous activities using knowledge from streaming facts and and domain specific rules encoded in the Ontology we will create. We will evaluate our techniques by realizing and validating them in the vehicular domain.

Committee: Drs. Dr. Anupam Joshi (Chair), Dr. Tim Finin, Dr. Nilanjan Banerjee, Dr. Yelena Yesha, Dr. Wenjia Li, NYIT, Dr. Filip Perich, Google


 

Dealing with Dubious Facts in Knowledge Graphs

Tim Finin, 10:47am 22 November 2016

In this week’s meeting, Ankur Padia will about about his work on the problem of identifying and managing ‘dubious facts’ extracted from text and added to a knowledge graph.

Dealing with Dubious Facts in Knowledge Graphs

Ankur Padia

Knowledge graphs are used to represent real-world facts and events with entities as nodes and relations as labeled edges. Generally, a knowledge graph is automatically constructed by extracting facts from text corpus using information extraction (IE) techniques. Such IE techniques are scalable but often extract low quality (or dubious) facts due to errors caused by NLP libraries, internal components of an extraction system, choice of learning techniques, heuristics and syntactic complexity of underlying text. We wish to explore techniques to process such dubious facts and improve the quality of a knowledge graph.


 

Dynamic Topic Modeling to Infer the Influence of Research Citations on IPCC Assessment Reports

Tim Finin, 1:40pm 19 November 2016

IPCC Assessment Reports

A temporal analysis of the 200,000 documents cited in thirty years worth of Intergovernmental Panel on Climate Change (IPCC) assessment reports sheds light on how climate change research is evolving.

Jenifer Sleeman, Milton Halem, Tim Finin and Mark Cane, Dynamic Topic Modeling to Infer the Influence of Research Citations on IPCC Assessment Reports, Big Data Challenges, Research, and Technologies in the Earth and Planetary Sciences Workshop, IEEE Int. Conf. on Big Data, December 2016.

A common Big Data problem is the need to integrate large temporal data sets from various data sources into one comprehensive structure. Having the ability to correlate evolving facts between data sources can be especially useful in supporting a number of desired application functions such as inference and influence identification. As a real world application we use climate change publications based on the Intergovernmental Panel on Climate Change, which publishes climate change assessment reports every five years, with currently over 25 years of published content. Often these reports reference thousands of research papers. We use dynamic topic modeling as a basis for combining report and citation domains into one structure. We are able to correlate documents between the two domains to understand how the research has influenced the reports and how this influence has changed over time. In this use case, the topic report model used a total number of 410 documents and 5911 terms in the vocabulary while in the topic citations the vocabulary consisted of 25,154 terms and the number of documents was closer to 200,000 research papers.


 

Capturing policies for fine-grained access control on mobile devices

Tim Finin, 8:33am 8 November 2016

In this week’s ebiquity meeting (11:30 8 Nov. 2016) Prajit Das will present his work on capturing policies for fine-grained access control on mobile devices.

As of 2016, there are more mobile devices than humans on earth. Today, mobile devices are a critical part of our lives and often hold sensitive corporate and personal data. As a result, they are a lucrative target for attackers, and managing data privacy and security on mobile devices has become a vital issue. Existing access control mechanisms in most devices are restrictive and inadequate. They do not take into account the context of a device and its user when making decisions. In many cases, the access granted to a subject should change based on context of a device. Such fine-grained, context-sensitive access control policies have to be personalized too. In this paper, we present the Mithril system, that uses policies represented in Semantic Web technologies and captured using user feedback, to handle access control on mobile devices. We present an iterative feedback process to capture user specific policy. We also present a policy violation metric that allows us to decide when the capture process is complete.

 

Inferring Relations in Knowledge Graphs with Tensor Decomposition

Tim Finin, 10:44pm 6 November 2016

kgraph

Ankur Padia, Kostantinos Kalpakis, and Tim Finin, Inferring Relations in Multi-relational Knowledge Graphs with Tensor Decomposition, IEEE BigData, Dec. 2016.

Multi-relational data, like knowledge graphs, are generated from multiple data sources by extracting entities and their relationships. We often want to include inferred, implicit or likely relationships that are not explicitly stated, which can be viewed as link-prediction in a graph. Tensor decomposition models have been shown to produce state-of-the-art results in link-prediction tasks. We describe a simple but novel extension to an existing tensor decomposition model to predict missing links using similarity among tensor slices, as opposed to an existing tensor decomposition models which assumes each slice to contribute equally in predicting links. Our extended model performs better than the original tensor decomposition and the non-negative tensor decomposition variant of it in an evaluation on several datasets.


 

Knowledge for Cybersecurity

Tim Finin, 8:48am 17 October 2016

In this weeks ebiquity meeting (11:30am 10/18, ITE346), Sudip Mittal will talk on Knowledge for Cybersecurity.

In the broad domain of security, analysts and policy makers need knowledge about the state of the world to make critical decisions, operational/tactical as well as strategic. This knowledge has to be extracted from different sources, and then represented in a form that will enable further analysis and decision making. Some of this data underlying this knowledge is in textual sources traditionally associated with Open Sources Intelligence (OSINT), others in data that is present in hidden sources like dark web vulnerability markets. Today, this is a mostly manual process. We wish to automate this problem by taking data from a variety of sources, extracting, representing and integrating the knowledge present, and then use the resulting knowledge graph to create various semantic agents that add value to the cybersecurity infrastructure.


 

Knowledge for Cybersecurity

Tim Finin, 8:25am 17 October 2016

In this weeks ebiquity meeting (11:30am 10/18, ITE346), Sudip Mittal will talk on Knowledge for Cybersecurity.

In the broad domain of security, analysts and policy makers need knowledge about the state of the world to make critical decisions, operational/tactical as well as strategic. This knowledge has to be extracted from different sources, and then represented in a form that will enable further analysis and decision making. Some of this data underlying this knowledge is in textual sources traditionally associated with Open Sources Intelligence (OSINT), others in data that is present in hidden sources like dark web vulnerability markets. Today, this is a mostly manual process. We wish to automate this problem by taking data from a variety of sources, extracting, representing and integrating the knowledge present, and then use the resulting knowledge graph to create various semantic agents that add value to the cybersecurity infrastructure.


 

Managing Cloud Storage Obliviously

Tim Finin, 1:29pm 24 May 2016

Vaishali Narkhede, Karuna Pande Joshi, Tim Finin, Seung Geol Choi, Adam Aviv and Daniel S. Roche, Managing Cloud Storage Obliviously, International Conference on Cloud Computing, IEEE Computer Society, June 2016.

Consumers want to ensure that their enterprise data is stored securely and obliviously on the cloud, such that the data objects or their access patterns are not revealed to anyone, including the cloud provider, in the public cloud environment. We have created a detailed ontology describing the oblivious cloud storage models and role based access controls that should be in place to manage this risk. We have developed an algorithm to store cloud data using oblivious data structure defined in this paper. We have also implemented the ObliviCloudManager application that allows users to manage their cloud data by validating it before storing it in an oblivious data structure. Our application uses role-based access control model and collection based document management to store and retrieve data efficiently. Cloud consumers can use our system to define policies for storing data obliviously and manage storage on untrusted cloud platforms even if they are unfamiliar with the underlying technology and concepts of oblivious data structures.


 

Streamlining Management of Multiple Cloud Services

Tim Finin, 9:46pm 22 May 2016

cloudhandshake

Aditi Gupta, Sudip Mittal, Karuna Pande Joshi, Claudia Pearce and Anupam Joshi, Streamlining Management of Multiple Cloud Services, IEEE International Conference on Cloud Computing, June 2016.

With the increase in the number of cloud services and service providers, manual analysis of Service Level Agreements (SLA), comparison between different service offerings and conformance regulation has become a difficult task for customers. Cloud SLAs are policy documents describing the legal agreement between cloud providers and customers. SLA specifies the commitment of availability, performance of services, penalties associated with violations and procedure for customers to receive compensations in case of service disruptions. The aim of our research is to develop technology solutions for automated cloud service management using Semantic Web and Text Mining techniques. In this paper we discuss in detail the challenges in automating cloud services management and present our preliminary work in extraction of knowledge from SLAs of different cloud services. We extracted two types of information from the SLA documents which can be useful for end users. First, the relationship between the service commitment and financial credit. We represented this information by enhancing the existing Cloud service ontology proposed by us in our previous research. Second, we extracted rules in the form of obligations and permissions from SLAs using modal and deontic logic formalizations. For our analysis, we considered six publicly available SLA documents from different cloud computing service providers.