Privacy Preserving Distributed Data Mining: A Multi-objective Optimization and Algorithmic Game-theoretic Approach

by

Wednesday, September 16, 2009, 9:30am - Wednesday, September 16, 2009, 11:30am

325 ITE

datamining, privacy

Use of technology for data collection and analysis has seen an unprecedented growth in the last couple of decades. Individuals and organizations generate huge amount of data through everyday activities. This data is either centralized for pattern identification or mined in a distributed fashion for efficient knowledge discovery and collaborative computation. This, obviously, has raised serious concerns about privacy issues. The data mining community has responded to this challenge by developing a new breed of algorithms that are privacy preserving. Specifically, cryptographic techniques for secure multi-party function evaluation form the class of privacy preserving data mining algorithms for distributed computation environments. However, these algorithms require all participants in the distributed system to follow a monolithic privacy model and also make strong assumptions about the behavior of participating entities. These conditions do not necessarily hold true in practice. Therefore, most of the existing work in privacy preserving distributed data mining fail to serve the purpose when applied to large real-world distributed data mining applications.

In this dissertation we develop a novel framework for privacy preserving distributed data mining that allows personalization of privacy requirements for individuals in a large distributed system and removes certain assumptions regarding participant behavior, thereby making the framework efficient and real-world adaptable.

First, we propose the idea of personalized privacy for individuals in a large distributed system based on the fact that privacy is a social concept. Different parties in a distributed computing environment have varied privacy requirements for their data, and also varying availability of computation and communication resources. Therefore, we model privacy as a multi-objective optimization function where each party attempts to find the optimal choice between two conflicting objectives --- (i) maximizing the data privacy, and (ii) minimizing the cost associated with the privacy guarantee. Each party optimizes its own objective to define the privacy model parameter that satisfies its privacy and cost requirements and then participates in the collaborative computation.

Secondly, to address the issue of assumptions regarding user behavior in cryptography-based privacy preservation techniques, we formulate privacy preserving distributed data mining as a game. The participating entities are the players of the game and the strategies they adopt in communicating their data, doing necessary computations and attacking others’ data to reveal personal information, decide the result of the game in terms of the quality of the data mining results. Knowing that, in the absence of a supervisor, the tendency of any player in this game would be to cheat, we design penalizing mechanism and blend it with the distributed data mining algorithm for getting a self-correcting system that forces parties to follow the protocol and not cheat.

The framework that we have proposed is independent of the choice of the privacy model for the distributed computation and also applicable to any privacy preserving data mining application involving multi-party function evaluation in a distributed environment. To demonstrate the working of our framework, we have adapted it to work for some real life distributed data mining applications such as web advertisement ranking, distributed feature selection, and online similarity identification in browsing patterns. We have designed mechanisms for privacy preserving sum computation and inner product computation in a distributed environment and adapted the framework to work for Bayes optimal model of privacy and $epsilon$-differential privacy model. We have simulated the working of the distributed applications and presented experimental results for each of the algorithms developed, using the Distributed Data Mining Toolkit (DDMT) developed by the DIADIC laboratory at UMBC.

Committee

  • Dr. Hillol Kargupta (chair), CSEE
  • Dr. Tim Oates, CSEE
  • Dr. Tim Finin, CSEE
  • Dr. Aryya Gangopadhyay, IS
  • Dr. Tom Armstrong, Math

Hillol Kargupta

OWL Tweet

UMBC ebiquity