May 12th, 2016
Topic Modeling for Analyzing Document Collection
Computer Science, University of Miami
11:00am Monday, 16 May 2016, ITE 325b, UMBC
Topic modeling (in particular, Latent Dirichlet Analysis) is a technique for analyzing a large collection of documents. In topic modeling we view each document as a frequency vector over a vocabulary and each topic as a static distribution over the vocabulary. Given a desired number, K, of document classes, a topic modeling algorithm attempts to estimate concurrently K static distributions and for each document how much each K class contributes. Mathematically, this is the problem of approximating the matrix generated by stacking the frequency vectors into the product of two non-negative matrices, where both the column dimension of the first matrix and the row dimension of the second matrix are equal to K. Topic modeling is gaining popularity recently, for analyzing large collections of documents.
In this talk I will present some examples of applying topic modeling: (1) a small sentiment analysis of a small collection of short patient surveys, (2) exploratory content analysis of a large collection of letters, (3) document classification based upon topics and other linguistic features, and (4) exploratory analysis of a large collection of literally works. I will speak not only the exact topic modeling steps but also all the preprocessing steps for preparing the documents for topic modeling.
Mitsunori Ogihara is a Professor of Computer Science at the University of Miami, Coral Gables, Florida. There he directs the Data Mining Group in the Center for Computational Science, a university-wide organization for providing resources and consultation for large-scale computation. He has published three books and approximately 190 papers in conferences and journals. He is on the editorial board for Theory of Computing Systems and International Journal of Foundations of Computer Science. Ogihara received a Ph.D. in Information Sciences from Tokyo Institute of Technology in 1993 and was a tenure-track/tenured faculty member in the Department of Computer Science at the University of Rochester from 1994 to 2007.
April 24th, 2016
A Hybrid Task Graph Scheduler API
Tim Blattner, UMBC
10:30am Monday, 25 April 2016, ITE 346
Scalability of applications is a key requirement to gaining performance in hybrid computing. Scheduling code to utilize the parallelism is difficult, particularly when dealing with dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) API increases programmer productivity to develop hybrid applications by creating a multiple-producer, multiple-consumer workflow system. HTGS improves upon existing task graph solutions with its design of execution pipelines that enables multi-GPU computation through data decomposition and task graph clustering that are bound to physical GPUs. The HTGS API is also capable of managing dependencies between tasks, represents CPU and GPU memories independently, overlaps disk I/O and memory transfers, and utilizes all available compute resources. We demonstrate the HTGS API by comparing a hybrid microscopy image stitching application with and without HTGS. By using HTGS in image stitching, code size is reduced by ~25% and shows favorable performance compared to image stitching without HTGS.
January 20th, 2015
UMBC CSEE alumni Don Miner and Brandon Wilson have started a Meetup group for Hadoop users in and around the Baltimore area to discuss Hadoop technology and use cases.
Apache Hadoop is one of the most popular open-source tools used to harness clusters of computers to process, analyze or learn from massive amounts of data. Whether you are new to Hadoop or an experienced user, this is a great opportunity to improve your knowledge and network with others in the Baltimore computing technology community.
The first meeting will be held from 7:00pm to 9:30pm on Thursday, 19 February 2015 at AOL/Advertising.com at 1020 Hull St #100, Baltimore, MD (map). Join the group here.
January 17th, 2015
Facebook’s AI Research (FAIR) group has released open-source, optimized deep-learning modules for their open sourced Torch development environment for numerics, machine learning, and computer vision, with a particular emphasis on deep learning and convolutional nets.
The release includes GPU-optimized modules for large convolutional nets and networks with sparse activations that are commonly used in NLP applications.
See fbcunn for installation instructions, documentation and examples to train classifiers and iTorch for an IPython Kernel for Torch.
February 8th, 2014
In the first Ebiquity meeting of the semester, Vlad Korolev will talk about his work on using RDF for to capture, represent and use provenance information for big data experiments.
PROB: A tool for Tracking Provenance and Reproducibility of Big Data Experiments
10-11:30am, ITE346, UMBC
Reproducibility of computations and data provenance are very important goals to achieve in order to improve the quality of one’s research. Unfortunately, despite some efforts made in the past, it is still very hard to reproduce computational experiments with high degree of certainty. The Big Data phenomenon in recent years makes this goal even harder to achieve. In this work, we propose a tool that aids researchers to improve reproducibility of their experiments through automated keeping of provenance records.
December 13th, 2012
The Center for Hybrid Multicore Productivity Research is a collaborative research center sponsored by the National Science Foundation with two university partners (UMBC and University of California San Diego), six government, and seven industry members. The Center's research is focused on addressing productivity, performance, and scalability issues in meeting the insatiable computational demands of its members' applications through the continuous evolution of multicore architectures and open source tools.
As part of its annual industrial advisory board meeting next week, the center will hold an afternoon of public tutorials from 1:00pm to 4:00pm on Monday, 17 December 2012 in room 456 of the ITE building at UMBC. The tutorials will be presented by students doing research sponsored by the Center and feature some of the underlying technologies being used and some of their applications. The tutorials are:
- GPGPUs – Tim Blattner and Fahad Zafa
- Cloud Policies – Karuna Joshi
- Human Sensors Networks – Oleg Aulov
- Machine Learning Disaster Warnings – Han Dong
- Graph 500 – Tyler Simon
- HBase – Phuong Nyguen
The tutorial talks are free and open to the public. If you plan to attend, please RSVP by email to Dr. Valerie L. Thomas, email@example.com.
October 1st, 2011
mincemeat.py is a super-lightweight, open source Python implementation of the popular MapReduce distributed computing framework that only depend on the Python Standard Library.
Just install the single source file on a set of machines and invoke the script on them with a password (for authentication) and the IP address of the host and your workers are good to go. Then, using the same package, run simple server program that defines map, reduce and your data source.
While it’s only 350 lines of Python, the package looks great for teaching or experimenting with the MapReduce concept as well as being potentially useful if you work in Python.
September 20th, 2011
In this week’s ebiquity meeting (10:30am Tue 9/20 in ITE 325b) we will dive right into writing MapReduce programs, and we skip all the gory details about Hadoop setup and MapReduce theory. In one hour, we will write a MapReduce Java program using Eclipse to create an inverted-index, test it on a local box, and run it on an already set up Hadoop cluster. If we have time, we will also see how to do the same using Python instead of Java.
You are encouraged to do the following before the meeting if you want to code along.
- Review the Yahoo Introduction to MapReduce tutorial
- Download a free virtual machine image with Hadoop pre-installed, so you can get started quickly. Options are available for Linux, Windows and Mac OS X.
- Make sure you have JDK 1.6x and Eclipse (or your favourite IDE) installed on your laptop.
- If you are planning to code along during the demo, download the latest stable release of Hadoop (0.20.2)
- Some people have been having problems with Cloudera’s 64 bit VM image. If you do, try this virtual machine from Yahoo Developer Network that contains a pre-installed hadoop 0.20.
- Even if you are not able to get the VM running for now, you can still run the program(s) locally on your laptop using Eclipse.
February 24th, 2011
There will be a free CloudCamp meeting in Baltimore from 6:000pm to 10:00pm Wednesday March 9th at the Baltimore Marriott Waterfront. Cloudcamps are participants-driven unconferences where users of Cloud Computing technologies meet to network and share ideas, experiences, challenges and solutions. The event is free but participants are asked to register to ensure there is enough food and refreshments.
Here is the current, tentative schedule:
6:00pm – Registration & Networking (food/drink)
6:30pm – Opening Introductions
6:45pm – Lightning Talks (5 minutes each)
7:30pm – Unpanel
8:00pm – Organize Unconference
8:15pm – Unconference Breakout Session Round 1
9:00pm – Unconference Breakout Session Round 2
9:45pm – Wrap-up
10:00pm – Find somewhere for post-event networking
Contact the organizers if you are interested in giving a five minute lightning talk or lead breakout session.
October 28th, 2010
China’s Tianhe-1A is being recognized as the world’s fastest supercomputer. It has 7168 NVIDIA Tesla GPUs and achieved a Linpack score of 2.507 petaflops, a 40% speedup over Oak Ridge National Lab’s Jaguar, the previous top machine. Today’s WSJ has an article,
“Supercomputers are massive machines that help tackle the toughest scientific problems, including simulating commercial products like new drugs as well as defense-related applications such as weapons design and breaking codes. The field has long been led by U.S. technology companies and national laboratories, which operate systems that have consistently topped lists of the fastest machines in the world.
But Nvidia says the new system in Tianjin—which is being formally announced Thursday at an event in China—was able to reach 2.5 petaflops. That is a measure of calculating speed ordinarily translated into a thousand trillion operations per second. It is more than 40% higher than the mark set last June by a system called Jaguar at Oak Ridge National Laboratory that previously stood at No. 1 on a twice-yearly ranking of the 500 fastest supercomputers.”
The NYT and HPCwire also have good overview articles. The HPC article points out that the Tianhe-1A has a relatively low Linpack efficiency compaed to the Jaguar.
“Although the Linpack performance is a stunning 2.5 petaflops, the system left a lot of potential FLOPS in the machine. Its peak performance is 4.7 petaflops, yielding a Linpack efficiency of just over 50 percent. To date, this is a rather typical Linpack yield for GPGPU-accelerated supers. Because the GPUs are stuck on the relatively slow PCIe bus, the overhead of sending calculations to the graphics processors chews up quite a few cycles on both the CPUs and GPUs. By contrast, the CPU-only Jaguar has a Linpack/peak efficiency of 75 percent. Even so, Tianhe-1A draws just 4 megawatts of power, while Jaguar uses nearly 7 megawatts and yields 30 percent less Linpack.
The (unofficial) “official” list of the fastest supercomputers is TOP500 which seems to be inaccessible at the moment, due no doubt to the heavy load caused by the news stories above. The TOP500 list is due for a refresh next month.
September 11th, 2010
UMBC’s Multicore Computational Center will host the Second Workshop on Frontiers of Multi-Core Computing on 22-23 September 2010. The workshop will involve a wide range of people from universities, industry and government who will exchange ideas, discuss issues, and develop the strategies for coping with the challenges of parallel and multicore computing.
“Multi- (e.g., Intel Westmere and IBM Power7) and many-core (e.g., NVIDIA Tesla and AMD FireStream GPUs) microprocessors are enabling more compute- and data-intensive computation in desktop computers, clusters, and leadership supercomputers. However efficient utilization of these microprocessors is still a very challenging issue. Their differing architectures require significantly different programming paradigms when adapting real-world applications. The actual porting costs are actively debated, as well as the relative performance between GPUs and CPUs.”
The workshop is free but those interested should register online. See the workshop schedule for details on presentations and timing.
November 10th, 2009
The Economist has been running a series of online Oxford Union style debates on topical issues — CEO pay, healthcare, climate change, etc. The latest one is on the cloud computing: This house believes that the cloud can’t be entirely trusted.
In his opening remarks, moderator Ludwig Siegele says
“The participants in this debate, including the three guest speakers, all agree that computing is moving into the cloud. “We are experiencing a disruptive moment in the history of technology, with the expansion of the role of the internet and the advent of cloud-based computing”, says Stephen Elop, president of Microsoft’s business division, which generates about a third of the firm’s revenues ($13 billion) and more than half of its profits ($4.5 billion) in the most recent quarter. Marc Benioff, chief executive of Salesforce.com, the world’s largest SaaS provider with over $1.2 billion in sales in the past 12 months, is no less bullish: ‘Like the shift [from the mainframe to the client/server architecture] that roiled our industry in decades past, the transition to cloud computing is happening now because of major discontinuities in cost, value and function.'”
While the debate’s proposition suggests that security or privacy is its focus, it’s really a broader argument about how software services will be delivered in the future in which security is just one aspect.
“Whether and to what extent companies and consumers elect to hand their computing over to others, of course, depends on how much they trust the cloud. And customers still have many questions. How reliable are such services? What about privacy? Don’t I lose too much control? What if Salesforce.com, for instance, changes its service in a way I do not like? Are such web-based services really cheaper than traditional software? And how easy is it to get my data if I want to change providers? Are there open technical standards that would make this easier?”