Probabilistic Approximate Algorithms for Distributed Data Mining in Peer-to-Peer Networks
Monday, April 28, 2008, 11:15am
The dissertation introduces the concept of approximate local algorithms that can estimate data mining result within desired accuracy boundary with user-specified probabilistic guarantee by operating within a spatial locality of the executioner-node. As a foundation of probabilistic approximation in P2P network, a random-walk based uniform data sampling approach is proposed, that removes the bias and dependence in sampling caused by varying degrees of connectivity and sizes of data shared. Then the sampling technique is applied to develop approximate local algorithms for solving the specific data mining problem of K-means clustering and frequent itemset mining in the context of P2P network. Two K-means clustering algorithms are developed, one of which extends the concept of centralized K-means algorithm to distributed dynamic peer-to-peer environment, while the other provides probabilistic guarantee on accuracy of clustering result in a static P2P network. A frequent itemset mining algorithm is developed as a direct application of the uniform data sampling technique that discovers most of the frequent itemsets with high probability using bounded communication.
The main contribution of this research work is to introduce the concept of approximate local algorithms for data mining in P2P network that provides probabilistic guarantee of accuracy. It builds a basic tool for approximate data analysis in P2P network, a uniform data sampling technique, and develops communication efficient approximate local algorithms for mining data distributed in such network. The algorithms developed here provide data mining results within desired accuracy level and probabilistic guarantee, and shows good scalability with low communication overhead.