Privacy and confidentiality are of great concern in our era of Big Data. In this post, I want to discuss one formal approach to privacy, called differential privacy. The idea was invented by Dwork, McSherry, Nissim and Smith (2006). A nice review by Cynthia Dwork can be found here.
1. What Is It?
There is a long history of people proposing methods for releasing information while preserving privacy. But whenever someone released information, some smart cryptographer would find a way to crack the privacy.
A famous example is the Netflix Prize. Quoting from Wikipedia:
“Netflix has offered $1,000,000 prize for a 10% improvement in its recommendation system. Netflix has also released a training dataset for the competing developers to train their systems. While releasing this dataset they had provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. Netflix is not the only available movie rating portal on the web; there are many including IMDB. In IMDB also individuals can register and rate movies, moreover they have the option of not keeping their details anonymous. Narayanan and Shmatikov had cleverly linked the Netflix anonymized training database with the Imdb database (using the date of rating by a user) to partly de-anonymize the Netflix training database.”
The problem is that you have to protect against all other information that a malicious agent might have access to.
Suppose we have data database consisting of data points where . We want to release some sort of statistical summary of the database but we do not want to compromise any individual in the database. Consider an algorithm that takes a database as input and then outputs some random quantity . We say that two databases and are neighboring databases if they differ in one element. In that case we write .
The algorithm satisfies differential privacy if, for all sets , and all pairs ,
The intuition is this. Suppose you are considering whether or not to allow yourself to be included in a database. If differential privacy holds, then the output barely depends on whether or not you agree to be in the database. More precisely, the distribution of the output has low dependence on any one individual.
Suppose, to be concrete, that . Let be a real-valued function of the database, such as the sample mean. Define the sensitivity
where the supremum is over all neighboring pairs of data bases. In the case of the mean, .
Now we release
where has a Laplace distribution with density . It then can be shown easily that -differential privacy holds. (There are other ways to achieve differential privacy but this is the simplest.)
It is also possible to release a differentially private database. To do this, we first construct a differentially private density estimator . The we draw a sample . The resulting data base is also differentially private. See this and this for example.
3. What’s the Catch?
The field of differential privacy has grown very quickly. Do a Google search and you will find a goldmine of interesting papers.
But there is one problem. If the data are high-dimensional and you want to release a differentially private database, you will inevitably end up adding lots of noise. (You can see this by considering the simple but common situation of a multinomial distribution with many categories. The real data will have many empty categories but differential privacy forces you to fill in these zeroes with non-trivial probability.)
In short, differential privacy does not work well with high-dimensional data.
I attended a workshop on privacy at IPAM. The workshop was interesting because it brought together two very different groups of people. One group was computer scientists whose background was mostly in cryptography. The other group was applied statisticians. There was an interesting intellectual tension. The cryptographers wanted the strongest possible privacy guarantees. The statisticians wanted practical methods that worked well with large, messy datasets. I had sympathy for both positions.
I’m not an expert in this area but my current feeling is that differential privacy is too strong for many realistic settings. You end up outputting mostly noise. But I don’t yet know a good way to weaken differential privacy without giving up too much.
This is a fascinating area because it is of great practical interest, there is lots of interesting theory and it brings together theoretical computer science and statistics.
If you’re interested, I reiterate my suggestion to do a Google search on differential privacy. There’s a lot of stuff out there.
Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography Conference (TCC), Springer, 2006.
Cynthia Dwork (2006). Differential Privacy. International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1-12.