]>
In this dissertation we develop a novel framework for privacy preserving distributed data mining that allows personalization of privacy requirements for individuals in a large distributed system and removes certain assumptions regarding participant behavior, thereby making the framework efficient and real-world adaptable.
First, we propose the idea of personalized privacy for individuals in a large distributed system based on the fact that privacy is a social concept. Different parties in a distributed computing environment have varied privacy requirements for their data, and also varying availability of computation and communication resources. Therefore, we model privacy as a multi-objective optimization function where each party attempts to find the optimal choice between two conflicting objectives --- (i) maximizing the data privacy, and (ii) minimizing the cost associated with the privacy guarantee. Each party optimizes its own objective to define the privacy model parameter that satisfies its privacy and cost requirements and then participates in the collaborative computation.
Secondly, to address the issue of assumptions regarding user behavior in cryptography-based privacy preservation techniques, we formulate privacy preserving distributed data mining as a game. The participating entities are the players of the game and the strategies they adopt in communicating their data, doing necessary computations and attacking others’ data to reveal personal information, decide the result of the game in terms of the quality of the data mining results. Knowing that, in the absence of a supervisor, the tendency of any player in this game would be to cheat, we design penalizing mechanism and blend it with the distributed data mining algorithm for getting a self-correcting system that forces parties to follow the protocol and not cheat.
The framework that we have proposed is independent of the choice of the privacy model for the distributed computation and also applicable to any privacy preserving data mining application involving multi-party function evaluation in a distributed environment. To demonstrate the working of our framework, we have adapted it to work for some real life distributed data mining applications such as web advertisement ranking, distributed feature selection, and online similarity identification in browsing patterns. We have designed mechanisms for privacy preserving sum computation and inner product computation in a distributed environment and adapted the framework to work for Bayes optimal model of privacy and $epsilon$-differential privacy model. We have simulated the working of the distributed applications and presented experimental results for each of the algorithms developed, using the Distributed Data Mining Toolkit (DDMT) developed by the DIADIC laboratory at UMBC.