UMBC eBiquity Blog
Tim Finin, 11:30am 29 November 2015
10:30am, Monday 30 November 2015, ITE 346
Online social media is a powerful platform for dissemination of information during real world events. Beyond the challenges of volume, variety and velocity of content generated on online social media, veracity poses a much greater challenge for effective utilization of this content by citizens, organizations, and authorities. Veracity of information refers to the trustworthiness /credibility / accuracy / completeness of the content. This work addressed the challenge of veracity or trustworthiness of content posted on social media. We focus our work on Twitter, which is one of the most popular microblogging web service today. We provided an in-depth analysis of misinformation spread on Twitter during real world events. We showed effectiveness of automated techniques to detect misinformation on Twitter using a combination of content, meta-data, network, user profile and temporal features. We developed and deployed a novel framework, TweetCred for providing indication of trustworthiness / credibility of tweets posted during events. TweetCred, which was available as a browser plug-in, was installed and used by real Twitter users.
Dr. Aditi Gupta is a research associate in the Computer Science and Electrical Engineering Department at UMBC. She received her Ph.D. from the Indraprastha Institute of Information Technology, Delhi (IIIT-Delhi) in 2105 for her dissertation on designing and evaluating techniques to mitigate misinformation spread on microblogging web services.
Tim Finin, 7:08pm 28 November 2015
Ph.D. Dissertation Defense
Computer Science and Electrical Engineering
University of Maryland, Baltimore County
Rapid Plan Adaptation Through Offline
Analysis of Potential Plan Disruptors
Robert H. Holder, III
9:00am Wednesday, 9 December 2015, ITE 325b
Computing solutions to intractable planning problems is particularly problematic in dynamic, real-time domains. For example, visitation planning problems, such as a delivery truck that must deliver packages to various locations, can be mapped to a Traveling Salesman Problem (TSP). The TSP is an NP-complete problem, requiring planners to use heuristics to find solutions to any significantly large problem instance, and can require a lengthy amount of time. Planners that solve the dynamic variant, the Dynamic Traveling Salesman Problem (DTSP), calculate an efficient route to visit a set of potentially changing locations. When a new location becomes known, DTSP planners typically use heuristics to add the new locations to the previously computed route. Depending on the placement and quantity of these new locations, the efficiency of this adapted, approximated solution can vary significantly. Solving a DTSP in real time thus requires choosing between a TSP planner, which produces a relatively good but slowly generated solution, and a DTSP planner, which produces a less optimal solution relatively quickly.
Instead of quickly generating approximate solutions or slowly generating better solutions at runtime, this dissertation introduces an alternate approach of precomputing a library of high-quality solutions prior to runtime. One could imagine a library containing a high-quality solution for every potential problem instance consisting of potential new locations, but this approach obviously does not scale with increasing problem complexity. Because complex domains preclude creating a comprehensive library, I instead choose a subset of all possible plans to include. Strategic plan selection will ensure that the library contains appropriate plans for future scenarios.
Committee: Drs. Marie desJardins (co-chair), Tim Finin (co-chair), Tim Oates, Donald Miner, R. Scott Cost
Tim Finin, 10:44am 21 November 2015
Log files comprise a record of different events happening in various applications, operating systems and even in network devices. Originally they were used to record information for diagnostic and debugging purposes. Nowadays, logs are also used to track events which can be used in auditing and forensics in case of malicious activities or systems attacks. Various softwares like intrusion detection systems, web servers, anti-virus and anti-malware systems, firewalls and network devices generate logs with useful information, that can be used to protect against such system attacks. Analyzing log files can help in pro- actively avoiding attacks against the systems. While there are existing tools that do a good job when the format of log files is known, the challenge lies in cases where log files are from unknown devices and of unknown formats. We propose a framework that takes any log file and automatically gives out a semantic interpretation as a set of RDF Linked Data triples. The framework splits a log file into columns using regular expression-based or dictionary-based classifiers. Leveraging and modifying our existing work on inferring the semantics of tables, we identify every column from a log file and map it to concepts either from a general purpose KB like DBpedia or domain specific ontologies such as IDS. We also identify relationships between various columns in such log files. Converting large and verbose log files into such semantic representations will help in better search, integration and rich reasoning over the data.
Tim Finin, 11:46am 20 November 2015
Introduction to Deep Learning
Zhiguang Wang and Hang Gao
10:00am Monday, 23 November 2015, ITE 346
Deep learning has been a hot topic and all over the news lately. It is introduced with the ambition of moving Machine Learning closer to Artificial Intelligence, one of its original goals. Since the introduction of the concept of deep learning, various relevant algorithms are proposed and have achieved significant success in their corresponding areas. This talk aims at providing a brief overview of most common deep learning algorithms, along with their application on different tasks.
In this talk, Steve (Zhiguang Wang) will give a brief introduction about the application of deep learning algorithms in computer vision and speech, some basic viewpoints about training methods and attacking the non-convexity in deep neural nets along with some misc about deep learning.
On the other hand, Hang Gao will talk about common application of deep learning algorithms in Natural Language Processing, covering semantic, syntactic and sentiment analysis. He will also give a discussion on the limits of current application of deep learning algorithms in NLP and provide some ideas on possible future trend.
Tim Finin, 5:59pm 8 November 2015
In this report, we describe the Unified Cyber Security ontology (UCO) to support situational awareness in cyber security systems. The ontology is an effort to incorporate and integrate heterogeneous information available from different cyber security systems and most commonly used cyber security standards for information sharing and exchange. The ontology has also been mapped to a number of existing cyber security ontologies as well as concepts in the Linked Open Data cloud. Similar to DBpedia which serves as the core for Linked Open Data cloud, we envision UCO to serve as the core for the specialized cyber security Linked Open Data cloud which would evolve and grow with the passage of time with additional cybersecurity data sets as they become available. We also present a prototype system and concrete use-cases supported by the UCO ontology. To the best of our knowledge, this is the first cyber security ontology that has been mapped to general world ontologies to support broader and diverse security use-cases. We compare the resulting ontology with previous efforts, discuss its strengths and limitations, and describe potential future work directions.
Tim Finin, 9:48pm 5 November 2015
Extracting Structured Summaries
from Text Documents
Dr. Zareen Syed
Research Assistant Professor, UMBC
10:30am, Monday, 9 November 2015, ITE 346, UMBC
In this talk, Dr. Syed will present unsupervised approaches for automatically extracting structured summaries composed of slots and fillers (attributes and values) and important facts from articles, thus effectively reducing the amount of time and effort spent on gathering intelligence by humans using traditional keyword based search approaches. The approach first extracts important concepts from text documents and links them to unique concepts in Wikitology knowledge base. It then exploits the types associated with the linked concepts to discover candidate slots and fillers. Finally it applies specialized approaches for ranking and filtering slots to select the most relevant slots to include in the structured summary.
Compared with the state of the art, Dr. Syed’s approach is unrestricted, i.e., it does not require manually crafted catalogue of slots or relations of interest that may vary over different domains. Unlike Natural Language Processing (NLP) based approaches that require well-formed sentences, the approach can be applied on semi-structured text. Furthermore, NLP based approaches for fact extraction extract lexical facts and sentences that require further processing for disambiguating and linking to unique entities and concepts in a knowledge base, whereas, in Dr. Syed’s approach, concept linking is done as a first step in the discovery process. Linking concepts to a knowledge base provides the additional advantage that the terms can be explicitly linked or mapped to semantic concepts in other ontologies and are thus available for reasoning in more sophisticated language understanding systems.
Tim Finin, 9:45am 1 November 2015
To efficiently utilize their cloud based services, consumers have to continuously monitor and manage the Service Level Agreements (SLA) that define the service performance measures. Currently this is still a time and labor intensive process since the SLAs are primarily stored as text documents. We have significantly automated the process of extracting, managing and monitoring cloud SLAs using natural language processing techniques and Semantic Web technologies. In this paper we describe our prototype system that uses a Hadoop cluster to extract knowledge from unstructured legal text documents. For this prototype we have considered publicly available SLA/terms of service documents of various cloud providers. We use established natural language processing techniques in parallel to speed up cloud legal knowledge base creation. Our system considerably speeds up knowledge base creation and can also be used in other domains that have unstructured data.
Tim Finin, 10:41pm 30 October 2015
In this week’s ebiquity lab meeting (10:30am Monday Nov 2), Tim Finin will describe recent work on the Kelvin information extraction system and its performance in two tasks in the 2015 NIST Text Analysis Conference. Kelvin has been under development at the JHU Human Language Center of Excellence for several years. Kelvin reads documents in several languages and extracts entities and relations between them. This year it was used for the Coldstart Knowledge Base Population and Trilingual Entity Discovery and Linking tasks. Key components in the tasks are a system for cross-document coreference and another that links entities to entries in the Freebase knowledge base.
Tim Finin, 12:57pm 29 October 2015
Lyrics Augmented Multi-modal
1:00pm Friday 30 October, ITE 325b
In an increasingly mobile and connected world, digital music consumption has rapidly increased. More recently, faster and cheaper mobile bandwidth has given the average mobile user the potential to access large troves of music through streaming services like Spotify and Google Music that boast catalogs with tens of millions of songs. At this scale, effective music recommendation is critical for music discovery and personalized user experience.
Recommenders that rely on collaborative information suffer from two major problems: the long tail problem, which is induced by popularity bias, and the cold start problem caused by new items with no data. In such cases, they fall back on content to compute similarity. For music, content based features can be divided into acoustic and textual domains. Acoustic features are extracted from the audio signal while textual features come from song metadata, lyrical content, collaborative tags and associated web text.
Research in content based music similarity has largely been focused in the acoustic domain while text based features have been limited to metadata, tags and shallow methods for web text and lyrics. Song lyrics house information about the sentiment and topic of a song that cannot be easily extracted from the audio. Past work has shown that even shallow lyrical features improved audio-only features and in some tasks like mood classification, outperformed audio-only features. In addition, lyrics are also easily available which make them a valuable resource and warrant a deeper analysis.
The goal of this research is to fill the lyrical gap in existing music recommender systems. The first step is to build algorithms to extract and represent the meaning and emotion contained in the song’s lyrics. The next step is to effectively combine lyrical features with acoustic and collaborative information to build a multi-modal recommendation engine.
For this work, the genre is restricted to Rap because it is a lyrics-centric genre and techniques built for Rap can be generalized to other genres. It was also the highest streamed genre in 2014, accounting for 28.5% of all music streamed. Rap lyrics are scraped from dedicated lyrics websites like ohhla.com and genius.com while the semantic knowledge base comprising artists, albums and song metadata come from the MusicBrainz project. Acoustic features are directly used from EchoNest while collaborative information like tags, plays, co-plays etc. come from Last.fm.
Preliminary work involved extraction of compositional style features like rhyme patterns and density, vocabulary size, simile and profanity usage from over 10,000 songs by over 150 artists. These features are available for users to browse and explore through interactive visualizations on Rapalytics.com. Song semantics were represented using off-the-shelf neural language based vector models (doc2vec). Future work will involve building novel language models for lyrics and latent representations for attributes that is driven by collaborative information for multi-modal recommendation.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Pranam Kolari (WalmartLabs), Cynthia Matuszek and Tim Oates
Tim Finin, 5:32pm 25 October 2015
In this week’s ebiquity meeting (10:30am Monday, 26 October 2015 in ITE346 at UMBC), Sandeep Nair will talk about his research on securing the cyber-physical systems in modern vehicles.
Vehicles changed from being just mechanical devices which will just obey the commands to a smarter Sensor-ECU-Actuator systems which sense the surroundings and take necessary smart actions. A modern car has around forty to hundred different ECU’s, possibly communicating, to make intelligent decisions. But recently, there is a lot of buzz in the research community on hacking and taking control of vehicles. These literature describe and document the different ways to take control of vehicles. In this talk, we will first discuss what makes this kind of hacking possible? Then we will continue with different logical ways to do this and discuss some proposed mechanisms to protect it. We then propose a context aware mechanism which can detect these unsafe behaviors in the vehicle and describe the challenges associated with them.
Tim Finin, 12:34am 24 October 2015
Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi and Tim Finin, Robust Semantic Text Similarity Using LSA, Machine Learning and Linguistic Resources, Language Resources and Evaluation, Springer, to appear.
Semantic textual similarity is a measure of the degree of semantic equivalence between two pieces of text. We describe the SemSim system and its performance in the *SEM~2013~and SemEval-2014~tasks on semantic textual similarity. At the core of our system lies a robust distributional word similarity component that combines Latent Semantic Analysis and machine learning augmented with data from several linguistic resources. We used a simple term alignment algorithm to handle longer pieces of text. Additional wrappers and resources were used to handle task specific challenges that include processing Spanish text, comparing text sequences of different lengths, handling informal words and phrases, and matching words with sense definitions. In the *SEM~2013~task on Semantic Textual Similarity, our best performing system ranked first among the 89 submitted runs. In the SemEval-2014~task on Multilingual Semantic Textual Similarity, we ranked a close second in both the English and Spanish subtasks. In the SemEval-2014~task on Cross–Level Semantic Similarity, we ranked first in Sentence–Phrase, Phrase-Word, and Word-Sense subtasks and second in the Paragraph-Sentence subtask.
Tim Finin, 7:38pm 16 October 2015
Demystifying Word2Vec – A Hands-on Tutorial
10:30am Monday, 19 October 2015 **ITE 456**
In the world of NLP, Word2Vec is one of the coolest kids in town! But what exactly is it and how does it work? More importantly, how is it used/useful?
For the first 10-15 minutes, we will go over distributional an distributed representation of words and the neural language model behind Word2Vec. We will also briefly look at doc2vec, the extension of Word2Vec for longer pieces of text.
For the remainder of the time (45-60 minutes), we will get our feet wet by running Word2Vec on a dataset which will then be followed by discussions about potential ways it can be useful for your own work.
What to bring – Any computing machine with Python installed, lots of curiosity and some delicious snacks for me maybe? We will use the excellent gensim package for python to run Word2Vec along with cython to speed things up. If you aren’t familiar with Python or don’t like it, no worries! It’s really just 5-6 lines of code! The training dataset will be provided. If you wish to bring your own, that’s cool too.
NOTE: We will hold this week’s Ebiquity meeting in ITE 456.