Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure
Friday, September 28, 2007, 9:30am - Friday, September 28, 2007, 11:30am
Extraction of meaningful knowledge from data grids requires development of sophisticated data mining techniques. This dissertation investigates Distributed Data Mining (DDM) techniques for the grid infrastructure. We study the data grids in the astronomy domain as a representative example. The gigantic, heterogeneous, geographically distributed repositories of the astronomy sky surveys pose challenges to the data miner since most off-the-shelf data mining systems require the data to be downloaded to a single location before further analysis. This imposes serious scalability constraints on the data mining system and fundamentally hinders the scientific discovery process. In order to enable astronomers to tap the richness of sky survey catalogs, we describe a system for Distributed Exploration of Massive Astronomical Catalogs (DEMAC) which contains algorithms for distributed (1) Principal Component Analysis (PCA) enabling dimension reduction of correlated astrophysical parameters (2) Outlier Detection for identification of ``interesting" galaxies and (3) Classification of astronomical sources.
The demand for catalog data from the astronomy community has been increasing fast. This has ushered in new mechanisms to support scalable performance. One such mechanism allows users to download and locally manage different parts of the overall repository resulting in partial images of the data in distributed environments. Collaboration amongst users with such personalized databases (MyDBs) results in the formation of distributed peer-to-peer (P2P) networks. We investigate strategies for data transfer and mining in peer-to-peer MyDB environments.