Probabilistic Approximate Algorithms for Distributed Data Mining in Peer-to-Peer Networks

by

Monday, April 28, 2008, 11:15am

datamining, dissertation, p2p, privacy

Peer-to-peer(P2P) computing is emerging as a new distributed computing paradigm for novel applications that involves exchange of information among peers with little centralized coordination. Analyzing data distributed in P2P networks requires peer-to-peer data mining algorithms that can mine the data without data centralization. However, replicating result of centralized data mining in an exact fashion is often communication-wise expensive. Approximate algorithms can be a realistic and communication-efficient alternative in this case.This dissertation concentrates on developing approximate data mining algorithms suitable for P2P networks, that closely estimates the result of centralized data mining algorithm with probabilistic guarantee using minimal communication.

The dissertation introduces the concept of approximate local algorithms that can estimate data mining result within desired accuracy boundary with user-specified probabilistic guarantee by operating within a spatial locality of the executioner-node. As a foundation of probabilistic approximation in P2P network, a random-walk based uniform data sampling approach is proposed, that removes the bias and dependence in sampling caused by varying degrees of connectivity and sizes of data shared. Then the sampling technique is applied to develop approximate local algorithms for solving the specific data mining problem of K-means clustering and frequent itemset mining in the context of P2P network. Two K-means clustering algorithms are developed, one of which extends the concept of centralized K-means algorithm to distributed dynamic peer-to-peer environment, while the other provides probabilistic guarantee on accuracy of clustering result in a static P2P network. A frequent itemset mining algorithm is developed as a direct application of the uniform data sampling technique that discovers most of the frequent itemsets with high probability using bounded communication.

The main contribution of this research work is to introduce the concept of approximate local algorithms for data mining in P2P network that provides probabilistic guarantee of accuracy. It builds a basic tool for approximate data analysis in P2P network, a uniform data sampling technique, and develops communication efficient approximate local algorithms for mining data distributed in such network. The algorithms developed here provide data mining results within desired accuracy level and probabilistic guarantee, and shows good scalability with low communication overhead.

Hillol Kargupta

OWL Tweet

UMBC ebiquity