Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure

Friday, September 28, 2007, 9:30am - Friday, September 28, 2007, 11:30am

325b ITE

The grid-based computing paradigm has attracted much attention in recent years. The sharing of distributed computing resources (such as software, hardware, data, sensors, etc) is an important aspect of grid computing. Computational Grids focus on methods for handling compute intensive tasks while Data Grids are geared toward data-intensive computing. Grid-based computing has been put to use in several scientific disciplines such as astronomy, engineering, climate studies, ecology, biology and health sciences. For example, in astronomy, breakthroughs in telescope, detector, and computer technology allow sky surveys to produce terabytes of images and catalogs which are typically stored in data grids.

Extraction of meaningful knowledge from data grids requires development of sophisticated data mining techniques. This dissertation investigates Distributed Data Mining (DDM) techniques for the grid infrastructure. We study the data grids in the astronomy domain as a representative example. The gigantic, heterogeneous, geographically distributed repositories of the astronomy sky surveys pose challenges to the data miner since most off-the-shelf data mining systems require the data to be downloaded to a single location before further analysis. This imposes serious scalability constraints on the data mining system and fundamentally hinders the scientific discovery process. In order to enable astronomers to tap the richness of sky survey catalogs, we describe a system for Distributed Exploration of Massive Astronomical Catalogs (DEMAC) which contains algorithms for distributed (1) Principal Component Analysis (PCA) enabling dimension reduction of correlated astrophysical parameters (2) Outlier Detection for identification of ``interesting" galaxies and (3) Classification of astronomical sources.

The demand for catalog data from the astronomy community has been increasing fast. This has ushered in new mechanisms to support scalable performance. One such mechanism allows users to download and locally manage different parts of the overall repository resulting in partial images of the data in distributed environments. Collaboration amongst users with such personalized databases (MyDBs) results in the formation of distributed peer-to-peer (P2P) networks. We investigate strategies for data transfer and mining in peer-to-peer MyDB environments.

Hosted by: Hillol Kargupta

OWL Tweet