As cybersecurity-related threats continue to increase, understanding how the field is changing over time can give insight into combating new threats and understanding historical events. We show how to apply dynamic topic models to a set of cybersecurity documents to understand how the concepts found in them are changing over time. We correlate two different data sets, the first relates to specific exploits and the second relates to cybersecurity research. We use Wikipedia concepts to provide a basis for performing concept phrase extraction and show how using concepts to provide context improves the quality of the topic model. We represent the results of the dynamic topic model as a knowledge graph that could be used for inference or information discovery.
We introduce the notion of reinforcement quantum annealing (RQA) scheme in which an intelligent agent searches in the space of Hamiltonians and interacts with a quantum annealer that plays the stochastic environment role of learning automata. At each iteration of RQA, after analyzing results (samples) from the previous iteration, the agent adjusts the penalty of unsatisfied constraints and re-casts the given problem to a new Ising Hamiltonian. As a proof-of-concept, we propose a novel approach for casting the problem of Boolean satisfiability (SAT) to Ising Hamiltonians and show how to apply the RQA for increasing the probability of finding the global optimum. Our experimental results on two different benchmark SAT problems (namely factoring pseudo-prime numbers and random SAT with phase transitions), using a D-Wave 2000Q quantum processor, demonstrated that RQA finds notably better solutions with fewer samples, compared to the best-known techniques in the realm of quantum annealing.
Automated Data Augmentation via Wikidata Relationships
Oyesh Singh, UMBC 10:30-11:30 Monday, 21 October 2019, ITE 346
With the increase in complexity of machine learning models, there is more need for data than ever. In order to fill this gap of annotated data-scarce situation, we look towards the ocean of free data present in Wikipedia and other WIkimedia resources. Wikipedia has an enormous amount of data in many languages along with the knowledge graph defined in Wikidata. In this presentation, I will explain how we utilized the Wikipedia/Wikidata data to boost the performance of BERT models for named entity recognition.
Compressive sensing is a novel approach that linearly samples sparse or compressible signals at a rate much below the Nyquist-Shannon sampling rate and outperforms traditional signal processing techniques in acquiring and reconstructing such signals. Compressive sensing with matrix uncertainty is an extension of the standard compressive sensing problem that appears in various applications including but not limited to cognitive radio sensing, calibration of the antenna, and deconvolution. The original problem of compressive sensing is NP-hard so the traditional techniques, such as convex and nonconvex relaxations and greedy algorithms, apply stringent constraints on the measurement matrix to indirectly handle this problem in the realm of classical computing.
We propose well-posed approaches for both binary compressive sensing and binary compressive sensing with matrix uncertainty problems that are tractable by quantum annealers. Our approach formulates an Ising model whose ground state represents a sparse solution for the binary compressive sensing problem and then employs an alternating minimization scheme to tackle the binary compressive sensing with matrix uncertainty problem. This setting only requires the solution uniqueness of the considered problem to have a successful recovery process, and therefore the required conditions on the measurement matrix are notably looser. As a proof of concept, we can demonstrate the applicability of the proposed approach on the D-Wave quantum annealers; however, we can adapt our method to employ other modern computing phenomena–like adiabatic quantum computers (in general), CMOS annealers, optical parametric oscillators, and neuromorphic computing.
To effectively identify and filter out attacks from known sources like botnets, spammers, virus infected systems etc., organizations increasingly procure services that determine the reputation of IP addresses. Adoption of encryption techniques like TLS 1.2 and 1.3 aggravate this cause, owing to the higher cost of decryption needed for examining traffic contents. Currently, most IP reputation services provide blacklists by analyzing malware and spam records. However, newer but similar IP addresses used by the same attackers need not be present in such lists and attacks from them will get bypassed. In this paper, we present Dynamic Attribute based Reputation (DAbR), a Euclidean distance-based technique, to generate reputation scores for IP addresses by assimilating meta-data from known bad IP addresses. This approach is based on our observation that many bad IP’s share similar attributes and the requirement for a lightweight technique for reputation scoring. DAbR generates reputation scores for IP addresses on a 0-10 scale which represents its trustworthiness based on known bad IP address attributes. The reputation scores when used in conjunction with a policy enforcement module, can provide high performance and non-privacy-invasive malicious traffic filtering. To evaluate DAbR, we calculated reputation scores on a dataset of 87k IP addresses and used them to classify IP addresses as good/bad based on a threshold. An F-1 score of 78% in this classification task demonstrates our technique’s performance.
Ontology-Grounded Topic Modeling for Climate Science Research
Jennifer Sleeman, Milton Halem and Tim Finin, Ontology-Grounded Topic Modeling for Climate Science Research, Semantic Web for Social Good Workshop, Int. Semantic Web Conf., Monterey, Oct. 2018. (Selected as best paper), to appear, Emerging Topics in Semantic Technologies, E. Demidova, A.J. Zaveri, E. Simperl (Eds.), AKA Verlag Berlin, 2018.
In scientific disciplines where research findings have a strong impact on society, reducing the amount of time it takes to understand, synthesize and exploit the research is invaluable. Topic modeling is an effective technique for summarizing a collection of documents to find the main themes among them and to classify other documents that have a similar mixture of co-occurring words. We show how grounding a topic model with an ontology, extracted from a glossary of important domain phrases, improves the topics generated and makes them easier to understand. We apply and evaluate this method to the climate science domain. The result improves the topics generated and supports faster research understanding, discovery of social networks among researchers, and automatic ontology generation.
Understanding large, structured documents like scholarly articles, requests for proposals or business reports is a complex and difficult task. It involves discovering a document’s overall purpose and subject(s), understanding the function and meaning of its sections and subsections, and extracting low level entities and facts about them. In this research, we present a deep learning based document ontology to capture the general purpose semantic structure and domain specific semantic concepts from a large number of academic articles and business documents. The ontology is able to describe different functional parts of a document, which can be used to enhance semantic indexing for a better understanding by human beings and machines. We evaluate our models through extensive experiments on datasets of scholarly articles from arXiv and Request for Proposal documents.
Open Information Extraction for Code-Mixed Hindi-English Social Media Data
1:00pm Monday, 2 July 2018, ITE 325b, UMBC
Open domain relation extraction (Angeli, Premkumar, & Manning 2015) is a process of finding relation triples. While there are a number of available systems for open information extraction (Open IE) for a single language, traditional Open IE systems are not well suited to content that contains multiple languages in a single utterance. In this thesis, we have extended a existing code mix corpus (Das, Jamatia, & Gambck 2015) by finding and annotating relation triples in Open IE fashion. Using this newly annotated corpus, we have experimented with seq2seq neural network (Zhang, Duh, & Durme 2017) for finding the relationship triples. As prerequisite for relationship extraction pipeline, we have developed part-of-speech tagger and named entity and predicate recognizer for code-mix content. We have experimented with various approaches such as Conditional Random Fields (CRF), Average Perceptron and deep neural networks. According to our knowledge, this relationship extraction system is first ever contribution for any codemix natural language. We have achieved promising results for all of the components and it could be improved in future with more codemix data.
Committee: Drs. Frank Ferraro (Chair), Tim Finin, Hamed Pirsiavash, Bryan Wilkinson
Understanding the Logical and Semantic Structure of Large Documents
Muhammad Mahbubur Rahman
11:00am Wednesday, 30 May 2018, ITE 325b
Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports poses challenges not present in short documents. The reasons behind this challenge are that large documents may be multi-themed, complex, noisy and cover diverse topics. This dissertation describes a framework that can analyze large documents, and help people and computer systems locate desired information in them. It aims to automatically identify and classify different sections of documents and understand their purpose within the document. A key contribution of this research is modeling and extracting the logical and semantic structure of electronic documents using deep learning techniques. The effectiveness and robustness of ?the framework is evaluated through extensive experiments on arXiv and requests for proposals datasets.
Committee Members: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Cynthia Matuszek, James Mayfield (JHU)
Preventing Poisoning Attacks on Threat Intelligence Systems
Nitika Khurana, Graduate Student, UMBC
11:00-12:00 Monday, 23 April 2018, ITE346, UMBC
As AI systems become more ubiquitous, securing them becomes an emerging challenge. Over the years, with the surge in online social media use and the data available for analysis, AI systems have been built to extract, represent and use this information. The credibility of this information extracted from open sources, however, can often be questionable. Malicious or incorrect information can cause a loss of money, reputation, and resources; and in certain situations, pose a threat to human life. In this paper, we determine the credibility of Reddit posts by estimating their reputation score to ensure the validity of information ingested by AI systems. We also maintain the provenance of the output generated to ensure information and source reliability and identify the background data that caused an attack. We demonstrate our approach in the cybersecurity domain, where security analysts utilize these systems to determine possible threats by analyzing the data scattered on social media websites, forums, blogs, etc.
We describe the systems developed by the UMBC team for 2018 SemEval Task 8, SecureNLP (Semantic Extraction from CybersecUrity REports using Natural Language Processing). We participated in three of the sub-tasks: (1) classifying sentences as being relevant or irrelevant to malware, (2) predicting token labels for sentences, and (4) predicting attribute labels from the Malware Attribute Enumeration and Characterization vocabulary for defining malware characteristics. We achieved F1 scores of 50.34/18.0 (dev/test), 22.23 (test-data), and 31.98 (test-data) for Task1, Task2 and Task2 respectively. We also make our cybersecurity embeddings publicly available at https://bit.ly/cybr2vec.
Cognitively Rich Framework to Automate Extraction and Representation of Legal Knowledge
Srishty Saha, UMBC
11-12 Monday, 16 April 2018, ITE 346
With the explosive growth in cloud-based services, businesses are increasingly maintaining large datasets containing information about their consumers to provide a seamless user experience. To ensure privacy and security of these datasets, regulatory bodies have specified rules and compliance policies that must be adhered to by organizations. These regulatory policies are currently available as text documents that are not machine processable and so require extensive manual effort to monitor them continuously to ensure data compliance. We have developed a cognitive framework to automatically parse and extract knowledge from legal documents and represent it using an Ontology. The legal ontology captures key-entities and their relations, the provenance of legal-policy and cross-referenced semantically similar legal facts and rules. We have applied this framework to the United States government’s Code of Federal Regulations (CFR) which includes facts and rules for individuals and organizations seeking to do business with the US Federal government.