| UMBC ebiquity |
Probabilistic Approximate Algorithms for Distributed Data Mining in Peer-to-Peer NetworksTweetSpeaker: Souptik Datta Start: Monday, April 28, 2008, 11:15AM Abstract: Peer-to-peer(P2P) computing is emerging as a new distributed computing
paradigm for novel applications that involves exchange of information
among peers with little centralized coordination. Analyzing data
distributed in P2P networks requires peer-to-peer data mining algorithms
that can mine the data without data centralization. However,
replicating result of centralized data mining in an exact fashion is
often communication-wise expensive. Approximate algorithms can be a
realistic and communication-efficient alternative in this case.This
dissertation concentrates on developing approximate data mining
algorithms suitable for P2P networks, that closely estimates the result
of centralized data mining algorithm with probabilistic guarantee using
minimal communication.
The dissertation introduces the concept of approximate local algorithms
that can estimate data mining result within desired accuracy boundary
with user-specified probabilistic guarantee by operating within a
spatial locality of the executioner-node. As a foundation of
probabilistic approximation in P2P network, a random-walk based uniform
data sampling approach is proposed, that removes the bias and dependence
in sampling caused by varying degrees of connectivity and sizes of data
shared. Then the sampling technique is applied to develop approximate
local algorithms for solving the specific data mining problem of K-means
clustering and frequent
itemset mining in the context of P2P network. Two K-means clustering
algorithms are developed, one of which extends the concept of
centralized K-means algorithm to distributed dynamic peer-to-peer
environment, while the other provides probabilistic guarantee
on accuracy of clustering result in a static P2P network. A frequent
itemset mining algorithm is developed as a direct application of the
uniform data sampling technique that discovers most of the frequent
itemsets with high probability using bounded communication.
The main contribution of this research work is to introduce the concept
of approximate local algorithms for data mining in P2P network that
provides probabilistic guarantee of accuracy. It builds a basic tool
for approximate data analysis in P2P network, a uniform
data sampling technique, and develops communication efficient
approximate local algorithms for mining data distributed in such
network. The algorithms developed here provide data mining results
within desired accuracy level and probabilistic guarantee, and
shows good scalability with low communication overhead.
Tags: dissertation, datamining, p2p, privacy Host: Hillol Kargupta , |