The dissertation introduces the concept of approximate local algorithms
that can estimate data mining result within desired accuracy boundary
with user-specified probabilistic guarantee by operating within a
spatial locality of the executioner-node. As a foundation of
probabilistic approximation in P2P network, a random-walk based uniform
data sampling approach is proposed, that removes the bias and dependence
in sampling caused by varying degrees of connectivity and sizes of data
shared. Then the sampling technique is applied to develop approximate
local algorithms for solving the specific data mining problem of K-means
clustering and frequent
itemset mining in the context of P2P network. Two K-means clustering
algorithms are developed, one of which extends the concept of
centralized K-means algorithm to distributed dynamic peer-to-peer
environment, while the other provides probabilistic guarantee
on accuracy of clustering result in a static P2P network. A frequent
itemset mining algorithm is developed as a direct application of the
uniform data sampling technique that discovers most of the frequent
itemsets with high probability using bounded communication.
The main contribution of this research work is to introduce the concept
of approximate local algorithms for data mining in P2P network that
provides probabilistic guarantee of accuracy. It builds a basic tool
for approximate data analysis in P2P network, a uniform
data sampling technique, and develops communication efficient
approximate local algorithms for mining data distributed in such
network. The algorithms developed here provide data mining results
within desired accuracy level and probabilistic guarantee, and
shows good scalability with low communication overhead.
]]>