Cloud Computing : Data Privacy Preserving Methods
for Outsourced Computation
Under the assumption of untrusted cloud providers, outsourcing data intensive computation
such as database services and data mining to public clouds raises privacy concerns.
primary purpose of this assignment is to understand the data privacy problem with
outsourced computation in public clouds and implement several methods to mitigate this
problem. This project consists of two parts. (1) Multiplication perturbation methods for
outsourced data mining; and (2) Crypto-index for outsourced database services.
Part 1: Multiplication perturbation for outsourced data mining.
The idea of data perturbation is to change the original values of a dataset that is prepared
for outsourced data mining, so that the untrusted cloud provider cannot figure out any
original value while working on the perturbed data. The basic principle is to preserve both
data privacy and data utility – the data owner can still obtain high quality data mining
models with the perturbed data. In this experiment, we will be looking at one convenient
method – geometric data perturbation (GDP). GDP is designed to preserve the utility for
Euclidean-distance based data mining models such as kNN classifiers, linear classifiers,
Support Vector Machine (SVM) classifiers (with certain kernels), and clustering with
Euclidean distance as the similarity measure. In this project, we consider only classification
Let’s define GDP first. The data for mining is often a table with k columns and n rows,
denated as a k x n matrix. Let x be any record of the table, which is represented as a kelement
column vector. Let R be a secret random orthogonal matrix (so that R times R’s
transpose is the identity matrix I) and t be a secret random k-element column vector, both
of which are shared by all original records. Let d be a randomly generated k-element noise
vector, individually for each record, drawn from a normal distribution N(0, σ2
), where the
is determined by the user. Let y be the perturbed vector for the original vector x.
The geometric data perturbation is defined as
Classification modeling is to find the best function that can predict the label of each
record, by learning from a set of given examples in the form (record, label) – we call it
training data. The given training datasets are in the SVM format. Each line of the data file
represents a record, which looks like
where index:value indicates the value for the specific dimension “index” of the record. We
will use three datasets in the experiments: breast cancer, diabetes, and heart disease. You
can use this Python program or this Java program for reading the data.
When you apply GDP to protect classification modeling, you perturb only the records but
keep the label unchanged. As a result, each row the outsourced training data looks like
(perturbed record, label). In this way, the cloud side can still work on the perturbed records
label index:value index:value ..to train a classification model, which should have close accuracy as the one trained with the
original data (the accuracy is dependent on the setting of sigma). However, the cloud
provider cannot observe the original record, neither can they figure out the original
classification model, as long as the perturbation parameters R and t are not disclosed (under
certain security assumptions).
You will need to implement the GDP perturbation method and conduct some experiments to
understand the important features. Specifically, you need to finish the following tasks.
You need to implement the GDP perturbation program that can be run with the
following command line format.
where sigma_seting is the sigma value for generating the noise vector d, input_file is
one of the given data in SVM format, output_file is the perturbed data. Python and
Java are recommended for implementing the perturbation method.
A few classification methods: k nearest neighbor, SVM, and AdaBoost are given for
experiments, which are located at nimbus17:/home/hadoop/project4/. They can be
used in the following way.
The program has simplified all the parameter settings for training the classifiers. For
example, it uses five-fold cross-validation; the setting k of knnc is limited to less than
21; adaboost uses 500 trees; and svm parameters are not optimized. In the output of
the algorithms, you will find the accuracy and standard deviation, which will be
needed in your experiments. You will need to run several experiments to see the
accuracy of the models trained with (1) the original data, (2) the perturbed datasets
generated with sigma value of the noise distribution N(0, σ2
) set to 0.05, 0.1, 0.15,
0.2, 0.25, respectively. Replace the “training_data” file with the corresponding files
when you run the tests.
Answer the questions:
Question 1.1 Implement the GDP algorithm and paste your GDP implementation code here.
Question 1.2 Conduct the accuracy evaluation on the three datasets and report your
experimental results with the following table for each dataset. Each cell represents the
accuracy and standard deviation (e.g., in the form 0.82+/-0.02) for the corresponding
classification method and the sigma setting. Note that SVM output does not have standard
deviation, which is fine. Intepret the table and write down your observations and
gdp sigma_setting original_data_file perturbed_data_file
./adaboost -o training_data
./svm-train -v 5 training_datamethods
Question 1.3 The provided classificaiton program has hidden the details of “crossvalidation”.
The default setting is set to five folds. Find the definition of cross-validation from
the Web. Describe the procedure of cross-validation with your own words, and answer why
we need cross-validation.
Question 1.4 Under what assumptions the security of the GDP perturbation method is
guaranteed? What are the potential attacks? – describe a few that you can think of.
Question 1.5 How much time did you spend on Question 1.1 and 1.2, respectively?
Question 1.6 How useful is this task to your understanding of the problem of privacy
preserving outsourced mining in the cloud?
Part 2: Crypto-index for range query processing
Database management is resource consuming. An appealing solution is to export database
services to the cloud, which raises privacy concerns, too. Among the existing approaches,
crypto-index is a simple solution (not so secure) that transforms a domain to a set of
random IDs. It is easy to process queries on transformed data. However, the result may
have low precision. In this task, you will implement a simple crypto-index method and
experiment wtih single-dimensional range query on transformed data.
For simplicity, let’s work on the continuous data. Assume one dimensional of the data, for
example, the three datasets you have seen, is in the domain [min, max]. The crypto-index
method partitions the domain into a few bins and use a random ID to represent each bin.
Correspondingly, when you process a range query, say finding records with dimension i in
the range [u, v], you will transform the query to the transformed data domain as “finding
records with dimension i in the set of bin IDs [id1, id2,…]“.Let’s assume the input data is in the SVM format. The following transformation program will
transform one dimension of the data. Its incomplete python implementation is given as a
reference (you should revise test the code before using it, and you can implement your own
Java version as well if you prefer Java).
where dimension_i is the selected dimension for transformation (starting from 0), bin_width
is a percentage of the domain, i.e., 0.1 or 10%, indicating the whole domain is partitioned
into 10 equi-width bins, bin_encoding_file contains the bin encoding information for
and output_data is the file with values in dimension_i replaced with random_IDs according
to the bin encoding scheme. Now you need to implement other programs as follows.
Question 2.1 Develop a query evaluation program, the usage of which is
where [start, end] of dimension_i is the original range for search, bin_encoding is the one
generated from the previous program, and transformed_data is the output_data from the
previous program. The program will print the precision of the query, which is defined as (the
number of records exactly in [start, end])/(the number of records returned by query
processing on transformed data). Post the code here.
Question 2.2 Develop a simple script that processes a number n, e.g., n=10, of randomly
generated ranges for your queries. Note these ranges should be within the domain of
dimension_i. Compute the average query result precision of the n ranges, for different
bin_width settings: 0.02, 0.04, 0.06, 0.08, 0.1. You should try your script on the three
datasets. Report the result precision and standard deviation (e.g., 0.82+/-0.05) in the
following table, and also post your script here.
bin_width 0.02 0.04 0.06 0.08 0.1
crypto-index-transdata input_data dimenion_i bin_width bin_encoding_file output_data
start_1 end_1 random_ID1
start_2 end_2 random_ID2
crypto-index-query original_data transformed_data bin_encoding_file dimenion_i start endQuestion 2.3 Discuss the potential attacks to the crypto-index method that you can come
Question 2.4 How much time did you spend on each of the tasks 2.1-2.3, respectively?
Deliverables Turn in the PDF report that answers all the questions, including the code for