UMBC ebiquity
Big data

Archive for the 'Big data' Category

Jennifer Sleeman dissertation defense: Dynamic Data Assimilation for Topic Modeling

June 27th, 2017, by Tim Finin, posted in Big data, Earth science, Machine Learning, NLP, Ontologies, Semantic Web

Ph.D. Dissertation Defense

Dynamic Data Assimilation for Topic Modeling

Jennifer Sleeman
9:00am Thursday, 29 June 2017, ITE 325b, UMBC

Understanding how a particular discipline such as climate science evolves over time has received renewed interest. By understanding this evolution, predicting the future direction of that discipline becomes more achievable. Dynamic Topic Modeling (DTM) has been applied to a number of disciplines to model topic evolution as a means to learn how a particular scientific discipline and its underlying concepts are changing. Understanding how a discipline evolves, and its internal and external influences, can be complicated by how the information retrieved over time is integrated. There are different techniques used to integrate sources of information, however, less research has been dedicated to understanding how to integrate these sources over time. The method of data assimilation is commonly used in a number of scientific disciplines to both understand and make predictions of various phenomena, using numerical models and assimilated observational data over time.

In this dissertation, I introduce a novel algorithm for scientific data assimilation, called Dynamic Data Assimilation for Topic Modeling (DDATM), which uses a new cross-domain divergence method (CDDM) and DTM. By using DDATM, observational data in the form of full-text research papers can be assimilated over time starting from an initial model. DDATM can be used as a way to integrate data from multiple sources and, due to its robustness, can exploit the assimilating observational information to better tolerate missing model information. When compared with a DTM model, the assimilated model is shown to have better performance using standard topic modeling measures, including perplexity and topic coherence. The DDATM method is suitable for prediction and results in higher likelihood for subsequent documents. DDATM is able to overcome missing information during the assimilation process when compared with a DTM model. CDDM generalizes as a method that can also bring together multiple disciplines into one cohesive model enabling the identification of related concepts and documents across disciplines and time periods. Finally, grounding the topic modeling process with an ontology improves the quality of the topics and enables a more granular understanding of concept relatedness and cross-domain influence.

The results of this dissertation are demonstrated and evaluated by applying DDATM to 30 years of reports from the Intergovernmental Panel on Climate Change (IPCC) along with more than 150,000 documents that they cite to show the evolution of the physical basis of climate change.

Committee Members: Drs. Tim Finin (co-advisor), Milton Halem (co-advisor), Anupam Joshi, Tim Oates, Cynthia Matuszek, Mark Cane, Rafael Alonso

UMBC Data Science Graduate Program Starts Fall 2017

June 16th, 2017, by Tim Finin, posted in Big data, Data Science, Database, Datamining, KR, Machine Learning, NLP


UMBC Data Science Graduate Programs

UMBC’s Data Science Master’s program prepares students from a wide range of disciplinary backgrounds for careers in data science. In the core courses, students will gain a thorough understanding of data science through classes that highlight machine learning, data analysis, data management, ethical and legal considerations, and more.

Students will develop an in-depth understanding of the basic computing principles behind data science, to include, but not limited to, data ingestion, curation and cleaning and the 4Vs of data science: Volume, Variety, Velocity, Veracity, as well as the implicit 5th V — Value. Through applying principles of data science to the analysis of problems within specific domains expressed through the program pathways, students will gain practical, real world industry relevant experience.

The MPS in Data Science is an industry-recognized credential and the program prepares students with the technical and management skills that they need to succeed in the workplace.

For more information and to apply online, see the Data Science MPS site.

Large Scale Cross Domain Temporal Topic Modeling for Climate Change Research

December 23rd, 2016, by Tim Finin, posted in Big data, Machine Learning, NLP

Jennifer Sleeman, Milton Halem, Tim Finin, Mark Cane, Advanced Large Scale Cross Domain Temporal Topic Modeling Algorithms to Infer the Influence of Recent Research on IPCC Assessment Reports (poster), American Geophysical Union Fall Meeting 2016, American Geophysical Union, December 2016.

One way of understanding the evolution of science within a particular scientific discipline is by studying the temporal influences that research publications had on that discipline. We provide a methodology for conducting such an analysis by employing cross-domain topic modeling and local cluster mappings of those publications with the historical texts to understand exactly when and how they influenced the discipline. We apply our method to the Intergovernmental Panel on Climate Change (IPCC) Assessment Reports and the citations therein. The IPCC reports were compiled by thousands of Earth scientists and the assessments were issued approximately every five years over a 30 year span, and includes over 200,000 research papers cited by these scientists.

talk: A Hybrid Task Graph Scheduler API, Tim Blattner, UMBC

April 24th, 2016, by Tim Finin, posted in Big data, High performance computing

A Hybrid Task Graph Scheduler API

Tim Blattner, UMBC

10:30am Monday, 25 April 2016, ITE 346

Scalability of applications is a key requirement to gaining performance in hybrid computing. Scheduling code to utilize the parallelism is difficult, particularly when dealing with dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) API increases programmer productivity to develop hybrid applications by creating a multiple-producer, multiple-consumer workflow system. HTGS improves upon existing task graph solutions with its design of execution pipelines that enables multi-GPU computation through data decomposition and task graph clustering that are bound to physical GPUs. The HTGS API is also capable of managing dependencies between tasks, represents CPU and GPU memories independently, overlaps disk I/O and memory transfers, and utilizes all available compute resources. We demonstrate the HTGS API by comparing a hybrid microscopy image stitching application with and without HTGS. By using HTGS in image stitching, code size is reduced by ~25% and shows favorable performance compared to image stitching without HTGS.

Using Data Analytics to Detect Anomalous States in Vehicles

December 28th, 2015, by Tim Finin, posted in Big data, cybersecurity, Datamining, Machine Learning, Security


Sandeep Nair, Sudip Mittal and Anupam Joshi, Using Data Analytics to Detect Anomalous States in Vehicles, Technical Report, December 2015.

Vehicles are becoming more and more connected, this opens up a larger attack surface which not only affects the passengers inside vehicles, but also people around them. These vulnerabilities exist because modern systems are built on the comparatively less secure and old CAN bus framework which lacks even basic authentication. Since a new protocol can only help future vehicles and not older vehicles, our approach tries to solve the issue as a data analytics problem and use machine learning techniques to secure cars. We develop a hidden markov model to detect anomalous states from real data collected from vehicles. Using this model, while a vehicle is in operation, we are able to detect and issue alerts. Our model could be integrated as a plug-n-play device in all new and old cars.

Demystifying Word2Vec: A Hands-on Tutorial

October 16th, 2015, by Tim Finin, posted in Big data, Machine Learning, NLP, NLP

Demystifying Word2Vec – A Hands-on Tutorial

Abhay Kashyap

10:30am Monday, 19 October 2015 **ITE 456**

In the world of NLP, Word2Vec is one of the coolest kids in town! But what exactly is it and how does it work? More importantly, how is it used/useful?

For the first 10-15 minutes, we will go over distributional an distributed representation of words and the neural language model behind Word2Vec. We will also briefly look at doc2vec, the extension of Word2Vec for longer pieces of text.

For the remainder of the time (45-60 minutes), we will get our feet wet by running Word2Vec on a dataset which will then be followed by discussions about potential ways it can be useful for your own work.

What to bring – Any computing machine with Python installed, lots of curiosity and some delicious snacks for me maybe? We will use the excellent gensim package for python to run Word2Vec along with cython to speed things up. If you aren’t familiar with Python or don’t like it, no worries! It’s really just 5-6 lines of code! The training dataset will be provided. If you wish to bring your own, that’s cool too.

NOTE: We will hold this week’s Ebiquity meeting in ITE 456.

Querying RDF Data with Text Annotated Graphs

June 6th, 2015, by Tim Finin, posted in Big data, Database, Machine Learning, RDF, Semantic Web

New paper: Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng, Querying RDF Data with Text Annotated Graphs, 27th International Conference on Scientific and Statistical Database Management, San Diego, June 2015.

Scientists and casual users need better ways to query RDF databases or Linked Open Data. Using the SPARQL query language requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology used, and URIs for entities of interest. Natural language query systems are a powerful approach, but current techniques are brittle in addressing the ambiguity and complexity of natural language and require expensive labor to supply the extensive domain knowledge they need. We introduce a compromise in which users give a graphical “skeleton” for a query and annotates it with freely chosen words, phrases and entity names. We describe a framework for interpreting these “schema-agnostic queries” over open domain RDF data that automatically translates them to SPARQL queries. The framework uses semantic textual similarity to find mapping candidates and uses statistical approaches to learn domain knowledge for disambiguation, thus avoiding expensive human efforts required by natural language interface systems. We demonstrate the feasibility of the approach with an implementation that performs well in an evaluation on DBpedia data.

Discovering and Querying Hybrid Linked Data

June 5th, 2015, by Tim Finin, posted in Big data, KR, Machine Learning, Semantic Web


New paper: Zareen Syed, Tim Finin, Muhammad Rahman, James Kukla and Jeehye Yun, Discovering and Querying Hybrid Linked Data, Third Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, held in conjunction with the 12th Extended Semantic Web Conference, Portoroz Slovenia, June 2015.

In this paper, we present a unified framework for discovering and querying hybrid linked data. We describe our approach to developing a natural language query interface for a hybrid knowledge base Wikitology, and present that as a case study for accessing hybrid information sources with structured and unstructured data through natural language queries. We evaluate our system on a publicly available dataset and demonstrate improvements over a baseline system. We describe limitations of our approach and also discuss cases where our system can complement other structured data querying systems by retrieving additional answers not available in structured sources.

talk: Amit Sheth on Transforming Big data into Smart Data, 11a Tue 5/26

May 17th, 2015, by Tim Finin, posted in Big data, Semantic Web

Transforming big data into smart data:
deriving value via harnessing volume, variety
and velocity using semantics and semantic web

Professor Amit Sheth
Wright State University

11:00am Tuesday, 26 May 2015, ITE 325, UMBC

Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, "How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?" As I will show, Smart Data that gives such personalized and actionable information will need to utilize multimodal data and their metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on Machine Learning and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models. I will present a couple of Smart Data applications in development at Kno.e.sis from the domains of personalized health, health informatics, social data for social good, energy, disaster response, and smart city.

Amit Sheth is an Educator, Researcher and Entrepreneur. He is the LexisNexis Ohio Eminent Scholar, an IEEE Fellow, and the executive director of Kno.e.sis – the Ohio Center of Excellence in Knowledge-enabled Computing a Wright State University. In World Wide Web (WWW), it is placed among the top ten universities in the world based on 10-year impact. Prof. Sheth is a well cited computer scientists (h-index = 87, >30,000 citations), and appears among top 1-3 authors in World Wide Web (Microsoft Academic Search). He has founded two companies, and several commercial products and deployed systems have resulted from his research. His students are exceptionally successful; ten out of 18 past PhD students have 1,000+ citations each.

Host: Yelena Yesha,

talk: Studying Internet Latency via TCP Queries to DNS, 1:30pm Fri 2/27

February 15th, 2015, by Tim Finin, posted in Big data

ACM Tech Talk

Studying Internet Latency via TCP Queries to DNS

Dr. Yannis Labrou
Principal Data Architect, Verisign

1:30-2:30pm Friday, 27 February 2015, ITE 456, UMBC

Every day Verisign processes upwards of 100 billion authoritative DNS requests for .COM and .NET from all corners of the earth. The vast majority of these requests are via the UDP protocol. Because UDP is connectionless, it is impossible to passively estimate the latency of the UDP-based requests. A very small percentage of these requests though, are over TCP, thus providing the means to estimate the latency of specific requests and paths for a subset of the hosts that interact with Verisign’s network infrastructure.

In this work, we combine this relatively small number of datapoints from TCP (on the order of a few hundred million per day) with the much larger dataset of all DNS requests. Our focus is the process of data analysis of real world, imperfect data at very large scale with the goals of understanding network latency at an unprecedented magnitude, identifying large volume, high latency clients and improving their latency. We discuss the techniques we used for data selection and analysis and we present the results of a variety of analyses, such as deriving regional and country patterns, estimations for query latency for different countries and network locations, and techniques for identifying high latency clients.

It is important to note that latency results we will report are based on passive measurements from, essentially, the entire Internet. For this experiment we do not have control over the client side — where they are, which software, their configuration, their network congestion. This is significantly different from latency studied in any active measurement infrastructure such as Planet Lab, RIPE Atlas, Thousand Eyes, Catchpoint, etc.


Dr. Yannis Labrou is Principal Data Architect at Verisign Labs where he leads efforts to create value from the wealth of data that Verisign’s operations generate every day. He brings to Verisign 20 years of experience in conceiving, creating and bringing to fruition innovations; combining thinking big with laboring through the pains of materializing ideas. He has done so in an academic environment, at a startup company, while conducting government and DoD/DARPA sponsored research and for a global Fortune 200 company.

Before joining Verisign, Dr. Labrou was a Senior Researcher at Fujitsu Laboratories of America, Director of Technology and member of the executive staff of PowerMarket, an enterprise application software start-up company and a Research Assistant Professor at UMBC. He received his Ph.D. in Computer Science from UMBC, where his research focused on software agents, and a Diploma in Physics from the University of Athens, Greece. He has authored more than 40 peer-reviewed publications, with almost 4000 citations and he has been awarded 14 patents from the USPTO. His current research focus is data through the entire lifecycle from generation to monetization.

— more information and directions:

2015 Ontology Summit: Internet of Things: Toward Smart Networked Systems and Societies

January 14th, 2015, by Tim Finin, posted in Agents, AI, Big data, Ontologies, Semantic Web, Web

The Internet of Things (IoT) is the interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure.

The theme of the 2015 Ontology Summit is Internet of Things: Toward Smart Networked Systems and Societies. The Ontology Summit is an annual series of events (first started by Ontolog and NIST in 2006) that involve the ontology community and communities related to each year’s theme.

The 2015 Summit will hold a virtual discourse over the next three months via mailing lists and online panel sessions augmented conference calls. The Summit will culminate in a two-day face-to-face workshop on 13-14 April 2015 in Arlington, VA. The Summit’s goal is to explore how ontologies can play a significant role in the realization of smart networked systems and societies in the Internet of Things.

The Summit’s initial launch session will take place from 12:30pm to 2:00pm EDT on Thursday, January 15th and will include overview presentations from each of the four technical tracks. See the 2015 Ontology Summit for more information, the schedule and details on how to participate in these free an open events.

Exploring the meanings of geek vs. nerd

January 3rd, 2015, by Tim Finin, posted in Big data, NLP

click image for higher-resolution version

Mark Liberman pointed out a nice use of pmi to explore the difference in meaning of geek vs. nerd done last year by Burr Settles using Twitter data.

Settles’s original post, On “Geek” Versus “Nerd”, has a brief, but good, explanation of the method and data.

You are currently browsing the archives for the Big data category.

  Home | Archive | Login | Feed