As cybersecurity-related threats continue to increase, understanding how the field is changing over time can give insight into combating new threats and understanding historical events. We show how to apply dynamic topic models to a set of cybersecurity documents to understand how the concepts found in them are changing over time. We correlate two different data sets, the first relates to specific exploits and the second relates to cybersecurity research. We use Wikipedia concepts to provide a basis for performing concept phrase extraction and show how using concepts to provide context improves the quality of the topic model. We represent the results of the dynamic topic model as a knowledge graph that could be used for inference or information discovery.
In my knowledge graph class yesterday we talked about the SPARQL query language and I illustrated it with DBpedia queries, including an example getting data about the movie Double Indemnity. I had brought a google assistant device and used it to compare its answers to those from DBpedia. When I asked the Google assistant “Who starred in the film Double Indemnity”, the first person it mentioned was Raymond Chandler. I knew this was wrong, since he was one of its screenwriters, not an actor, and shared an Academy Award for the screenplay. DBpedia’s data was correct and did not list Chandler as one of the actors.
I did not feel too bad about this — we shouldn’t expect perfect accuracy in these huge, general purpose knowledge graphs and at least Chandler played an important role in making the film.
After class I looked at the Wikidata page for Double Indemnity (Q478209) and saw that it did list Chandler as an actor. I take this as evidence that Google’s knowledge Graph got this incorrect fact from Wikidata, or perhaps from a precursor, Freebase.
The good news 🙂 is that Wikidata had flagged the fact that Chandler (Q180377) was a cast member in Double Indemnity with a “potential Issue“. Clicking on this revealed that the issue was that Chandler was not known to have an occupation property that a “cast member” property (P161) expects, which includes twelve types, such as actor, opera singer, comedian, and ballet dancer. Wikidata lists chandler’s occupations as screenwriter, novelist, write and poet.
More good news 😀 is that the Wikidata fact had provenance information in the form of a reference stating that it came from CSFD (Q3561957), a “Czech and Slovak web project providing a movie database”. Following the link Wikidata provided led me eventually to the resource, which allowed my to search for and find its Double Indemnity entry. Indeed, it lists Raymond Chandler as one of the movie’s Hrají. All that was left to do was to ask for a translation, which confirmed that Hrají means “starring”.
Case closed? Well, not quite. What remains is fixing the problem.
The final good news 🙂 is that it’s easy to edit or delete an incorrect fact in Wikidata. I plan to delete the incorrect fact in class next Monday. I’ll look into possible options to add an annotation in some way to ignore the incorrect ?SFD source for Chander being a cast member over the weekend.
Some possible bad news 🙁 that public knowledge graphs like Wikidata might be exploited by unscrupulous groups or individuals in the future to promote false or biased information. Wikipedia is reasonably resilient to this, but the problem may be harder to manage for public knowledge graphs, which get much their data from other sources that could be manipulated.
Compressive sensing is a novel approach that linearly samples sparse or compressible signals at a rate much below the Nyquist-Shannon sampling rate and outperforms traditional signal processing techniques in acquiring and reconstructing such signals. Compressive sensing with matrix uncertainty is an extension of the standard compressive sensing problem that appears in various applications including but not limited to cognitive radio sensing, calibration of the antenna, and deconvolution. The original problem of compressive sensing is NP-hard so the traditional techniques, such as convex and nonconvex relaxations and greedy algorithms, apply stringent constraints on the measurement matrix to indirectly handle this problem in the realm of classical computing.
We propose well-posed approaches for both binary compressive sensing and binary compressive sensing with matrix uncertainty problems that are tractable by quantum annealers. Our approach formulates an Ising model whose ground state represents a sparse solution for the binary compressive sensing problem and then employs an alternating minimization scheme to tackle the binary compressive sensing with matrix uncertainty problem. This setting only requires the solution uniqueness of the considered problem to have a successful recovery process, and therefore the required conditions on the measurement matrix are notably looser. As a proof of concept, we can demonstrate the applicability of the proposed approach on the D-Wave quantum annealers; however, we can adapt our method to employ other modern computing phenomena–like adiabatic quantum computers (in general), CMOS annealers, optical parametric oscillators, and neuromorphic computing.
Understanding large, structured documents like scholarly articles, requests for proposals or business reports is a complex and difficult task. It involves discovering a document’s overall purpose and subject(s), understanding the function and meaning of its sections and subsections, and extracting low level entities and facts about them. In this research, we present a deep learning based document ontology to capture the general purpose semantic structure and domain specific semantic concepts from a large number of academic articles and business documents. The ontology is able to describe different functional parts of a document, which can be used to enhance semantic indexing for a better understanding by human beings and machines. We evaluate our models through extensive experiments on datasets of scholarly articles from arXiv and Request for Proposal documents.
2018 Mid-Atlantic Student Colloquium on Speech, Language and Learning
The 2018 Mid-Atlantic Student Colloquium on Speech, Language and Learning (MASC-SLL) is a student-run, one-day event on speech, language & machine learning research to be held at the University of Maryland, Baltimore County (UMBC) from 10:00am to 6:00pm on Saturday May 12. There is no registration charge and lunch and refreshments will be provided. Students, postdocs, faculty and researchers from universities & industry are invited to participate and network with other researchers working in related fields.
Students and postdocs are encouraged to submit abstracts describing ongoing, planned, or completed research projects, including previously published results and negative results. Research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields is welcome. Submissions and presentations must be made by students or postdocs. Accepted submissions will be presented as either posters or talks.
My dissertation research is developing an approach to identify and explain errors in a knowledge graph constructed by extracting entities and relations from text. Information extraction systems can automatically construct knowledge graphs from a large collection of documents, which might be drawn from news articles, Web pages, social media posts or discussion forums. The language understanding task is challenging and current extraction systems introduce many kinds of errors. Previous work on improving the quality of knowledge graphs uses additional evidence from background knowledge bases or Web searches. Such approaches are diffuclt to apply when emerging entities are present and/or only one knowledge graph is available. In order to address the problem I am using multiple complementary techniques including entitylinking, common sense reasoning, and linguistic analysis.
We describe an approach using dynamic topic modeling to model influence and predict future trends in a scientific discipline. Our study focuses on climate change and uses assessment reports of the Intergovernmental Panel on Climate Change (IPCC) and the papers they cite. Since 1990, an IPCC report has been published every five years that includes four separate volumes, each of which has many chapters. Each report cites tens of thousands of research papers, which comprise a correlated dataset of temporally grounded documents. We use a custom dynamic topic modeling algorithm to generate topics for both datasets and apply crossdomain analytics to identify the correlations between the IPCC chapters and their cited documents. The approach reveals both the influence of the cited research on the reports and how previous research citations have evolved over time. For the IPCC use case, the report topic model used 410 documents and a vocabulary of 5911 terms while the citations topic model was based on 200K research papers and a vocabulary more than 25K terms. We show that our approach can predict the importance of its extracted topics on future IPCC assessments through the use of cross domain correlations, Jensen-Shannon divergences and cluster analytics.
The Spatial Data on the Web Working Group has published a W3C Recommendation of the Time Ontology in OWL specification. The ontology provides a vocabulary for expressing facts about relations among instants and intervals, together with information about durations, and about temporal position including date-time information. Time positions and durations may be expressed using either the conventional Gregorian calendar and clock, or using another temporal reference system such as Unix-time, geologic time, or different calendars.
The OntologySummit is an annual series of online and in-person events that involves the ontology community and communities related to each year’s topic. The topic chosen for the 2018 Ontology Summit will be Ontologies in Context, which the summit describes as follows.
“In general, a context is defined to be the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed. Some examples of synonyms include circumstances, conditions, factors, state of affairs, situation, background, scene, setting, and frame of reference. There are many meanings of “context” in general, and also for ontologies in particular. The summit this year will survey these meanings and identify the research problems that must be solved so that contexts can succeed in achieving the full understanding and assessment of an ontology.”
Each year’s Summit comprises of a series of both online and face-to-face events that span about three months. These include a vigorous three-month online discourse on the theme, and online panel discussions, research activities which will culminate in a two-day face-to-face workshop and symposium.
Over the next two months, there will be a sequence of weekly online meetings to discuss, plan and develop the 2018 topic. The summit itself will start in January with weekly online sessions of invited speakers. Visit the the 2018 Ontology Summit site for more information and to see how you can participate in the planning sessions.
Government regulations are critical to understanding how to do business with a government entity and receive other bene?ts. However, government regulations are also notoriously long and organized in ways that can be confusing for novice users. Developing cognitive assistance tools that remove some of the burden from human users is of potential bene?t to a variety of users. The volume of data found in United States federal government regulation suggests a multiple-step approach to process the data into machine readable text, create an automated legal knowledge base capturing various facts and rules, and eventually building a legal question and answer system to acquire understanding from various regulations and provisions. Our work discussed in this paper represents our initial efforts to build a framework for Federal Acquisition Regulations System (Title 48, Code of Federal Regulations) in order to create an efficient legal knowledge base representing relationships between various legal elements, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules.
Digital Twin models are computerized clones of physical assets that can be used for in-depth analysis. Industrial production lines tend to have multiple sensors to generate near real-time status information for production. Industrial Internet of Things datasets are difficult to analyze and infer valuable insights such as points of failure, estimated overhead. etc. In this paper we introduce a simple way of formalizing knowledge as digital twin models coming from sensors in industrial production lines. We present a way on to extract and infer knowledge from large scale production line data, and enhance manufacturing process management with reasoning capabilities, by introducing a semantic query mechanism. Our system primarily utilizes a graph-based query language equivalent to conjunctive queries and has been enriched with inference rules.
UMBC’s Data Science Master’s program prepares students from a wide range of disciplinary backgrounds for careers in data science. In the core courses, students will gain a thorough understanding of data science through classes that highlight machine learning, data analysis, data management, ethical and legal considerations, and more.
Students will develop an in-depth understanding of the basic computing principles behind data science, to include, but not limited to, data ingestion, curation and cleaning and the 4Vs of data science: Volume, Variety, Velocity, Veracity, as well as the implicit 5th V — Value. Through applying principles of data science to the analysis of problems within specific domains expressed through the program pathways, students will gain practical, real world industry relevant experience.
The MPS in Data Science is an industry-recognized credential and the program prepares students with the technical and management skills that they need to succeed in the workplace.