Archive for the 'Machine Learning' Category
May 12th, 2016, by Tim Finin, posted in Datamining, High performance computing, Machine Learning, NLP
Topic Modeling for Analyzing Document Collection
Computer Science, University of Miami
11:00am Monday, 16 May 2016, ITE 325b, UMBC
Topic modeling (in particular, Latent Dirichlet Analysis) is a technique for analyzing a large collection of documents. In topic modeling we view each document as a frequency vector over a vocabulary and each topic as a static distribution over the vocabulary. Given a desired number, K, of document classes, a topic modeling algorithm attempts to estimate concurrently K static distributions and for each document how much each K class contributes. Mathematically, this is the problem of approximating the matrix generated by stacking the frequency vectors into the product of two non-negative matrices, where both the column dimension of the first matrix and the row dimension of the second matrix are equal to K. Topic modeling is gaining popularity recently, for analyzing large collections of documents.
In this talk I will present some examples of applying topic modeling: (1) a small sentiment analysis of a small collection of short patient surveys, (2) exploratory content analysis of a large collection of letters, (3) document classification based upon topics and other linguistic features, and (4) exploratory analysis of a large collection of literally works. I will speak not only the exact topic modeling steps but also all the preprocessing steps for preparing the documents for topic modeling.
Mitsunori Ogihara is a Professor of Computer Science at the University of Miami, Coral Gables, Florida. There he directs the Data Mining Group in the Center for Computational Science, a university-wide organization for providing resources and consultation for large-scale computation. He has published three books and approximately 190 papers in conferences and journals. He is on the editorial board for Theory of Computing Systems and International Journal of Foundations of Computer Science. Ogihara received a Ph.D. in Information Sciences from Tokyo Institute of Technology in 1993 and was a tenure-track/tenured faculty member in the Department of Computer Science at the University of Rochester from 1994 to 2007.
May 8th, 2016, by Tim Finin, posted in cybersecurity, Machine Learning, Security
Vehicles can be considered as a specialized form of Cyber Physical Systems with sensors, ECU’s and actuators working together to produce a coherent behavior. With the advent of external connectivity, a larger attack surface has opened up which not only affects the passengers inside vehicles, but also people around them. One of the main causes of this increased attack surface is because of the advanced systems built on top of old and less secure common bus frameworks which lacks basic authentication mechanisms. To make such systems more secure, we approach this issue as a data analytic problem that can detect anomalous states. To accomplish that we collected data flowing between different components from real vehicles and using a Hidden Markov Model, we detect malicious behaviors and issue alerts, while a vehicle is in operation. Our evaluations using single parameter and two parameters together provide enough evidence that such techniques could be successfully used to detect anomalies in vehicles. Moreover our method could be used in new vehicles as well as older ones.
March 27th, 2016, by Tim Finin, posted in cybersecurity, Machine Learning, Mobile Computing, Security
Down the rabbit hole: An Android system call study
Prajit Kumar Das
10:30 am, Monday, March 28, 2016 ITE 346
App permissions and application sandboxing are the fundamental security mechanisms that protects user data on mobile platforms. We have worked on permission analytics before and come to a conclusion that just studying an app’s requested access rights (permissions) isn’t enough to understand potential data breaches. Techniques like privilege escalation have been previously used to gain further access to user and her data on mobile platforms like Android. Static code analysis and dynamic code execution may be studied to gather further insight into an app’s behavior. However, there is a need to study such a behavior at the lowest level of code execution and that is system calls. The system call is the fundamental interface between an application and the Linux kernel. In our current project, we are studying system calls made by apps for gathering a better understanding of their behavior.
February 27th, 2016, by Tim Finin, posted in AI, Machine Learning, NLP
Image description using deep neural networks
10:30 am, Monday, February 29, 2016 ITE 346
With the explosion of image data on the internet, there has been a need for automatic generation of image descriptions. In this project we use deep neural networks for extracting vectors from images and we use them to generate text that describes the image. The model that we built makes use of the pre-trained VGGNET- a model for image classification and a recurrent neural network (RNN) for language modelling. The combination of the two neural networks provides a multimodal embedding between image vectors and word vectors. We trained the model on 8000 images from the Flickr8k dataset and we present our results on test images downloaded from the Internet. We provide a web-service for image description generation that takes the image URL as input and provides image description and image categories as output. Through our service, a user can correct the description automatically generated by the system so that we can improve our model using corrected description.
Sunil Gandhi is a Computer Science Ph.D. student at UMBC who is part of the Cognition Robotics and Learning Lab (CORAL) research lab.
February 15th, 2016, by Tim Finin, posted in Machine Learning
Developmental Memetic Algorithms: A Fast and
Efficient Approach for Optimization Applications
10:30am, Monday, 22 February 2016, ITE 346
A Memetic algorithm, as a hybrid strategy, is an intelligent optimization method in problem solving. These algorithms are similar in nature to genetic algorithms as they follow evolutionary strategies, but they also incorporate a refinement phase during which they learn about the problem and search space. The efficiency of these algorithms depends on the nature and architecture of the imitation operator used. In this presentation, after a brief introduction, pros and cons of employing memetic algorithms would be discussed. Afterwards, developmental memetic algorithms will be proposed as an approach for subsiding the costs of using standard memetic algorithms. Developmental memetic algorithm is an adaptive memetic algorithm that has been developed in which the influence factor of environment on the learning abilities of each individual is set adaptively. This translates into a level of autonomous behavior, after a while that individuals gain some experience. Simulation results on benchmark function proved that this adaptive approach can increase the quality of the results and decrease the computation time simultaneously. The adaptive memetic algorithm also shows better stability when compared with the classic memetic algorithm.
December 28th, 2015, by Tim Finin, posted in Big data, cybersecurity, Datamining, Machine Learning, Security
Vehicles are becoming more and more connected, this opens up a larger attack surface which not only affects the passengers inside vehicles, but also people around them. These vulnerabilities exist because modern systems are built on the comparatively less secure and old CAN bus framework which lacks even basic authentication. Since a new protocol can only help future vehicles and not older vehicles, our approach tries to solve the issue as a data analytics problem and use machine learning techniques to secure cars. We develop a hidden markov model to detect anomalous states from real data collected from vehicles. Using this model, while a vehicle is in operation, we are able to detect and issue alerts. Our model could be integrated as a plug-n-play device in all new and old cars.
November 29th, 2015, by Tim Finin, posted in Machine Learning, Semantic Web, Social media, Web
10:30am, Monday 30 November 2015, ITE 346
Online social media is a powerful platform for dissemination of information during real world events. Beyond the challenges of volume, variety and velocity of content generated on online social media, veracity poses a much greater challenge for effective utilization of this content by citizens, organizations, and authorities. Veracity of information refers to the trustworthiness /credibility / accuracy / completeness of the content. This work addressed the challenge of veracity or trustworthiness of content posted on social media. We focus our work on Twitter, which is one of the most popular microblogging web service today. We provided an in-depth analysis of misinformation spread on Twitter during real world events. We showed effectiveness of automated techniques to detect misinformation on Twitter using a combination of content, meta-data, network, user profile and temporal features. We developed and deployed a novel framework, TweetCred for providing indication of trustworthiness / credibility of tweets posted during events. TweetCred, which was available as a browser plug-in, was installed and used by real Twitter users.
Dr. Aditi Gupta is a research associate in the Computer Science and Electrical Engineering Department at UMBC. She received her Ph.D. from the Indraprastha Institute of Information Technology, Delhi (IIIT-Delhi) in 2105 for her dissertation on designing and evaluating techniques to mitigate misinformation spread on microblogging web services.
November 21st, 2015, by Tim Finin, posted in Machine Learning, Semantic Web
Log files comprise a record of different events happening in various applications, operating systems and even in network devices. Originally they were used to record information for diagnostic and debugging purposes. Nowadays, logs are also used to track events which can be used in auditing and forensics in case of malicious activities or systems attacks. Various softwares like intrusion detection systems, web servers, anti-virus and anti-malware systems, firewalls and network devices generate logs with useful information, that can be used to protect against such system attacks. Analyzing log files can help in pro- actively avoiding attacks against the systems. While there are existing tools that do a good job when the format of log files is known, the challenge lies in cases where log files are from unknown devices and of unknown formats. We propose a framework that takes any log file and automatically gives out a semantic interpretation as a set of RDF Linked Data triples. The framework splits a log file into columns using regular expression-based or dictionary-based classifiers. Leveraging and modifying our existing work on inferring the semantics of tables, we identify every column from a log file and map it to concepts either from a general purpose KB like DBpedia or domain specific ontologies such as IDS. We also identify relationships between various columns in such log files. Converting large and verbose log files into such semantic representations will help in better search, integration and rich reasoning over the data.
November 20th, 2015, by Tim Finin, posted in Machine Learning, NLP
Introduction to Deep Learning
Zhiguang Wang and Hang Gao
10:00am Monday, 23 November 2015, ITE 346
Deep learning has been a hot topic and all over the news lately. It is introduced with the ambition of moving Machine Learning closer to Artificial Intelligence, one of its original goals. Since the introduction of the concept of deep learning, various relevant algorithms are proposed and have achieved significant success in their corresponding areas. This talk aims at providing a brief overview of most common deep learning algorithms, along with their application on different tasks.
In this talk, Steve (Zhiguang Wang) will give a brief introduction about the application of deep learning algorithms in computer vision and speech, some basic viewpoints about training methods and attacking the non-convexity in deep neural nets along with some misc about deep learning.
On the other hand, Hang Gao will talk about common application of deep learning algorithms in Natural Language Processing, covering semantic, syntactic and sentiment analysis. He will also give a discussion on the limits of current application of deep learning algorithms in NLP and provide some ideas on possible future trend.
October 29th, 2015, by Tim Finin, posted in Machine Learning, NLP, RDF, Semantic Web
Lyrics Augmented Multi-modal
1:00pm Friday 30 October, ITE 325b
In an increasingly mobile and connected world, digital music consumption has rapidly increased. More recently, faster and cheaper mobile bandwidth has given the average mobile user the potential to access large troves of music through streaming services like Spotify and Google Music that boast catalogs with tens of millions of songs. At this scale, effective music recommendation is critical for music discovery and personalized user experience.
Recommenders that rely on collaborative information suffer from two major problems: the long tail problem, which is induced by popularity bias, and the cold start problem caused by new items with no data. In such cases, they fall back on content to compute similarity. For music, content based features can be divided into acoustic and textual domains. Acoustic features are extracted from the audio signal while textual features come from song metadata, lyrical content, collaborative tags and associated web text.
Research in content based music similarity has largely been focused in the acoustic domain while text based features have been limited to metadata, tags and shallow methods for web text and lyrics. Song lyrics house information about the sentiment and topic of a song that cannot be easily extracted from the audio. Past work has shown that even shallow lyrical features improved audio-only features and in some tasks like mood classification, outperformed audio-only features. In addition, lyrics are also easily available which make them a valuable resource and warrant a deeper analysis.
The goal of this research is to fill the lyrical gap in existing music recommender systems. The first step is to build algorithms to extract and represent the meaning and emotion contained in the song’s lyrics. The next step is to effectively combine lyrical features with acoustic and collaborative information to build a multi-modal recommendation engine.
For this work, the genre is restricted to Rap because it is a lyrics-centric genre and techniques built for Rap can be generalized to other genres. It was also the highest streamed genre in 2014, accounting for 28.5% of all music streamed. Rap lyrics are scraped from dedicated lyrics websites like ohhla.com and genius.com while the semantic knowledge base comprising artists, albums and song metadata come from the MusicBrainz project. Acoustic features are directly used from EchoNest while collaborative information like tags, plays, co-plays etc. come from Last.fm.
Preliminary work involved extraction of compositional style features like rhyme patterns and density, vocabulary size, simile and profanity usage from over 10,000 songs by over 150 artists. These features are available for users to browse and explore through interactive visualizations on Rapalytics.com. Song semantics were represented using off-the-shelf neural language based vector models (doc2vec). Future work will involve building novel language models for lyrics and latent representations for attributes that is driven by collaborative information for multi-modal recommendation.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Pranam Kolari (WalmartLabs), Cynthia Matuszek and Tim Oates
October 16th, 2015, by Tim Finin, posted in Big data, Machine Learning, NLP, NLP
Demystifying Word2Vec – A Hands-on Tutorial
10:30am Monday, 19 October 2015 **ITE 456**
In the world of NLP, Word2Vec is one of the coolest kids in town! But what exactly is it and how does it work? More importantly, how is it used/useful?
For the first 10-15 minutes, we will go over distributional an distributed representation of words and the neural language model behind Word2Vec. We will also briefly look at doc2vec, the extension of Word2Vec for longer pieces of text.
For the remainder of the time (45-60 minutes), we will get our feet wet by running Word2Vec on a dataset which will then be followed by discussions about potential ways it can be useful for your own work.
What to bring – Any computing machine with Python installed, lots of curiosity and some delicious snacks for me maybe? We will use the excellent gensim package for python to run Word2Vec along with cython to speed things up. If you aren’t familiar with Python or don’t like it, no worries! It’s really just 5-6 lines of code! The training dataset will be provided. If you wish to bring your own, that’s cool too.
NOTE: We will hold this week’s Ebiquity meeting in ITE 456.
September 26th, 2015, by Tim Finin, posted in cybersecurity, Machine Learning, Privacy, Security
Is your personal data at risk?
App analytics to the rescue
10:30am Monday, 28 September 28 2015, ITE346
According to Virustotal, a prominent virus and malware tool, the Google Play Store has a few thousand apps from major malware families. Given such a revelation, access control systems for mobile data management, have reached a state of critical importance. We propose the development of a system which would help us detect the pathways using which user’s data is being stolen from their mobile devices. We use a multi layered approach which includes app meta data analysis, understanding code patterns and detecting and eventually controlling dynamic data flow when such an app is installed on a mobile device. In this presentation we focus on the first part of our work and discuss the merits and flaws of our unsupervised learning mechanism to detect possible malicious behavior from apps in the Google Play Store.
You are currently browsing the archives for the Machine Learning category.